Presentation is loading. Please wait.

Presentation is loading. Please wait.

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK.

Similar presentations


Presentation on theme: "Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK."— Presentation transcript:

1 Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK AND ELIZABETH O’NEIL MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG ICPE 2013, PRAGUE, 24/04/2013

2 Real Life Data is Distributed Uniformly… ◦Customers zip codes typically clustered around metropolitan areas ◦Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific periods ◦US retail sales: ◦peak during Holiday Season ◦December sales are 2x of January sales RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 2 Source: US Census Data Well, Not Really

3 Student Seminar Signup Distribution RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 3

4 How Can Skew Effect Database Systems? Data placement ◦Partitioning ◦Indexing Data structures ◦Tree balance ◦Bucket fill ratio ◦Histograms Optimizer  finding the optimal query plan ◦Index vs. non-index driven plans ◦Hash join vs. merge join ◦Hash group by vs. sort group by 4 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

5 Agenda Data Skew in Current Benchmarks Star Schema Benchmark (SSB) Parallel Data Generation Framework (PDGF) Introducing Skew in SSB 5 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

6 Data Skew in Benchmarks TPC-D (1994-1999): only uniform data ◦SIGMOD 1997 - “Successor of TPC-D should include data skew” ◦No effect until … TPC-DS (released 2012) ◦Contains comparability zones ◦Not fully utilized TPC-D/H variations ◦Chaudhuri and Narayasa: Zipfian distribution on all columns ◦Crolotte and Ghazal: comparability zones Still lots of open potential 6 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

7 Star Schema Benchmark I Star schema version of TPC-H ◦Merged Order and Lineitem ◦Date dimension ◦Dropped Partsupp ◦Selectivity hierarchies ◦C_City  C_Nation  C_Region ◦… RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 7

8 Star Schema Benchmark II Completely new set of queries 4 flights of 3-4 queries ◦Designed for functional coverage and selectivity coverage ◦Drill down in dimension hierarchies ◦Predefined selectivity RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 8 select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25; select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_yearmonthnum = 199301 and lo_discount between 1 and 3 and lo_quantity between 26 and 35; Drilldown Q1.1 Q1.2

9 Parallel Data Generation Framework Generic data generation framework Relational model ◦Schema specified in configuration file ◦Post-processing stage for alternative representations Repeatable computation ◦Based on XORSHIFT random number generators ◦Hierarchical seeding strategy RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 9 Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12. Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11. Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11. Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10.

10 Configuring PDGF Schema configuration Relational model ◦Tables, fields Properties ◦Table size, characters, … Generators ◦Simple generators ◦Metagenerators Update definition ◦Insert, update, delete ◦Generated as change data capture RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 10 ${S} <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> 0 true 9 Supplier [..] PDGF XML DB

11 Opportunities to Inject Data Skew in Foreign key relations ◦E.g., L_PARTKEY One fact table measures ◦E.g., L_Quantity Single dimension hierarchy ◦E.g., P_Brand → P_Category → P_Mfgr Multiple dimension hierarchies ◦E.g., City → Nation in Supplier and Customer Experimental methodology ◦One experiment series for each of the above ◦Comparison to original SSB ◦Comparison of index-forced, non-index, and automatic optimizer mode ◦SSB scale factor 100 (100 GB), x86 server 11 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

12 Skew in Foreign Key Relations Very realistic Easy to implement in PDGF ◦Just add a distribution to the reference But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes  Very limited effect on selectivity Focus on attributes in selectivity predicates 12 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

13 Lo_Quantity distribution ◦Values range between 0 and 50 ◦Originally uniform distribution with: ◦P(X=x)=0.02 ◦Coefficient of variation of 0.00000557 ◦Proposed skewed distribution with: ◦ Query 1.1 ◦lo_quantity < x, x ∈ [2, 51] Results ◦Switches too early to non-index plan ◦Switches too late to non-index plan ◦Optimizer agnostic to distribution Skew in Fact Table Measure – Lo_Quantity 13 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

14 Skew in Single Dimension Hierarchy - Part P_Category distribution ◦Uniform P(X=x)=0.04 ◦Skewed P(X=x)= 0.01 - 48.36 ◦Probabilities explicitly defined Query 2.1 ◦Restrictions on two dimensions Results uniform case ◦Index driven superior ◦Optimizer chooses non-index driven Results skewed case ◦Switches too early to non-index plan RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 14

15 Skewed S_City & C_City ◦Probabilites exponentially distributed Query 3.3 ◦Restrictions on 3 dimensions ◦Variation on Supplier and Customer city Results uniform and skewed cases ◦Automatic plan performs best ◦Cross over between automatic uniform and skewed too late Skew in Multiple Dimension Hierarchies – S_City & C_City RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 15 Join CardinalityElapsed Time

16 Conclusion & Future Work PDGF implementation of SSB Introduction of skew in SSB Extensive performance analysis ◦Several interesting optimizer effects ◦Performance impact of skew Future Work Further analysis on impact of skew Skew in query generation Complete suite for testing skew effects 16 RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

17 Thanks Questions? Download and try PDGF: http://www.paralleldatageneration.org (scripts used in the study available on website above) RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 17

18 Back-up Slides RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 18

19 Configuring PDGF Generation Generation configuration Defines the output ◦Scheduling ◦Data format ◦Sorting ◦File name and location Post processing ◦Filtering of values ◦Merging of tables ◦Splitting of tables ◦Templates (e.g. XML / queries) RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 19 [..] <!-- int y = (fields [0]. getPlainValue ()).intValue (); int d = (fields [1]. getPlainValue ()).intValue (); int q = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder, date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); -->


Download ppt "Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK."

Similar presentations


Ads by Google