Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24,

Similar presentations


Presentation on theme: "Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24,"— Presentation transcript:

1 Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24, New York City MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG

2 DBMS Benchmarking is Increasingly Complex Data Volumes are sky rocketing  Enterprise data warehouses double every three years  Many enterprise data warehouses are in petabyte size Systems are becoming increasingly complex  Large number of processor cores  Single systems (SMP) with high number of cores (80 on commodity hardware, 2048 on specialized hardware)  Multi node systems (sky is the limit)  Large memory  Dell released a TPC-H benchmark with 15 TB of main memory on 64 systems How to challenge these systems?

3 Benchmarks are increasingly complex More tables, columns More relationships, dependencies, data types, … How to build these benchmarks? Parallel Data Generation Framework to the rescue!

4 Parallel Data Generation Framework Generic data generation framework Relational model  Schema specified in configuration file  Post-processing stage for alternative representations Repeatable computation  Based on XORSHIFT random number generators  Hierarchical seeding strategy

5 Repeatable Data Generation

6 PDGF Architecture Controller  Initialization Meta Scheduler  Inter node scheduling Scheduler  Inter thread scheduling Worker  Blockwise data generation Update Black Box  Co-ordination of data updates Seeding System  Random sequence adaption Generators  Value generation Output system  Data formating To generate data for a schema the user defines:  Schema XML file  Defines relational schema  Generation XML file  Defines output format (CSV, XML, merging tables)

7 Configuring PDGF Schema configuration  Data model Relational model  Tables, fields Properties  Table size, characters, … Generators  Base generators  Meta generators Update definition  Insert, update, delete  Generated as change data capture ${S} <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> 0 true 9 Supplier [..]

8 Base Generators in PDGF DictList generator  Random line from file Long generator  Random long in interval Others  StaticValue  Double  Date  String  Text  … 10000 java.sql.types.VARCHAR 100 dicts/names.dict java.sql.types.NUMERIC 0 120

9 Null Generator Add NULL logic to every generator?  Could easily be implemented in higher class  Adds to the configuration file  Reduces performance (every time) Higher order generator NullGenerator  Only used if added to the schema  Can be added to any generator java.sql.types.NUMERIC 0.05 0 120

10 Meta Generators Control flow and post-processing generators  Null generator controls flow Post-processing  FormattedNumberGenerator  PaddingGenerator  UpperLowerCaseGenerator  PrePostfixGenerator  FormulaGenerator Flow control  ProbabilityGenerator  SequentialGenerator  IfGenerator  SwitchGenerator  ReferenceGenerator

11 Post-Processing Example Phone number for users  10s of representations  PhoneNumberGenerator was too inflexible Formatted long number  Long numbers between 10010001 and 9999999999  Number formatting ( %d%d%d) %d%d%d-%d%d%d%d java.sql.types.VARCHAR 30 10010001 9999999999 (%d%d%d) %d%d%d-%d%d%d%d

12 Flow Control Example More elaborate name field  Name male or female  50% chance  All upper case  Padded to 100 characters Sequential generator  Probability generator  DictList generator  UpperLowerCase generator  Padding generator java.sql.types.VARCHAR 100 dicts/female.dict dicts/male.dict uppercase true

13 Core Performance Test environment: single core laptop, no I/O Base time for framework ~ 55 ns (Base Time)  Seeding, method invocation, setting a value Computation time for generator 50+ ns (Gen Time) Cache update if referenced ~ 50 ns (Cache Update) Cache lookup if intra row reference ~ 50 ns (Cache Lookup) Sub-generator invocation ~ 50 ns

14 Performance Basic Generators Basic generators without formatting  120ns – 510ns

15 Performance Formatted Values Basic Generators with formatting  Usually > 1000ns

16 Performance Meta Generators Meta generator overhead:  Base overhead ~ 50 ns  Generator overhead starts from 50 ns  Sub generator invocation ~ 50ns Often negligible due to lazy formatting

17 Use Cases TPC-H / SSB  8 tables, 61 columns (first non-trivial example)  Without meta-FVGs: 26 custom FVGs  2h editing: 10 custom FVGs  1 day reimplementation: 0 custom FVGs, i.e. no coding  SSB variations  skews on dimension attributes, fact measures, references TPC-DI (in process)  20 tables, 200 columns  19 custom FVGs (mainly for performance in corner cases)  56x NullGenerator  32x ProbabilityGenerator  3000 lines of config (XML import for multiple files).

18 Conclusion & Future Work Meta generators  Improve usability and expressiveness  Speed up schema definition  Remove necessity for coding  Enlarged configuration files Used in TPC benchmark(s) Performance overhead is small, often negligible Future work  GUI and SQL export  SQL import and data extraction

19 Thanks Questions? Contact: tilmann.rabl@utoronto.ca Download and try PDGF: http://www.paralleldatageneration.org Some big data info in our BigBench presentation  Tuesday, 4pm, Industry 3


Download ppt "Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24,"

Similar presentations


Ads by Google