Download presentation
Presentation is loading. Please wait.
Published byMarianna Grant Modified over 9 years ago
1
Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24, New York City MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG
2
DBMS Benchmarking is Increasingly Complex Data Volumes are sky rocketing Enterprise data warehouses double every three years Many enterprise data warehouses are in petabyte size Systems are becoming increasingly complex Large number of processor cores Single systems (SMP) with high number of cores (80 on commodity hardware, 2048 on specialized hardware) Multi node systems (sky is the limit) Large memory Dell released a TPC-H benchmark with 15 TB of main memory on 64 systems How to challenge these systems?
3
Benchmarks are increasingly complex More tables, columns More relationships, dependencies, data types, … How to build these benchmarks? Parallel Data Generation Framework to the rescue!
4
Parallel Data Generation Framework Generic data generation framework Relational model Schema specified in configuration file Post-processing stage for alternative representations Repeatable computation Based on XORSHIFT random number generators Hierarchical seeding strategy
5
Repeatable Data Generation
6
PDGF Architecture Controller Initialization Meta Scheduler Inter node scheduling Scheduler Inter thread scheduling Worker Blockwise data generation Update Black Box Co-ordination of data updates Seeding System Random sequence adaption Generators Value generation Output system Data formating To generate data for a schema the user defines: Schema XML file Defines relational schema Generation XML file Defines output format (CSV, XML, merging tables)
7
Configuring PDGF Schema configuration Data model Relational model Tables, fields Properties Table size, characters, … Generators Base generators Meta generators Update definition Insert, update, delete Generated as change data capture ${S} <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> 0 true 9 Supplier [..]
8
Base Generators in PDGF DictList generator Random line from file Long generator Random long in interval Others StaticValue Double Date String Text … 10000 java.sql.types.VARCHAR 100 dicts/names.dict java.sql.types.NUMERIC 0 120
9
Null Generator Add NULL logic to every generator? Could easily be implemented in higher class Adds to the configuration file Reduces performance (every time) Higher order generator NullGenerator Only used if added to the schema Can be added to any generator java.sql.types.NUMERIC 0.05 0 120
10
Meta Generators Control flow and post-processing generators Null generator controls flow Post-processing FormattedNumberGenerator PaddingGenerator UpperLowerCaseGenerator PrePostfixGenerator FormulaGenerator Flow control ProbabilityGenerator SequentialGenerator IfGenerator SwitchGenerator ReferenceGenerator
11
Post-Processing Example Phone number for users 10s of representations PhoneNumberGenerator was too inflexible Formatted long number Long numbers between 10010001 and 9999999999 Number formatting ( %d%d%d) %d%d%d-%d%d%d%d java.sql.types.VARCHAR 30 10010001 9999999999 (%d%d%d) %d%d%d-%d%d%d%d
12
Flow Control Example More elaborate name field Name male or female 50% chance All upper case Padded to 100 characters Sequential generator Probability generator DictList generator UpperLowerCase generator Padding generator java.sql.types.VARCHAR 100 dicts/female.dict dicts/male.dict uppercase true
13
Core Performance Test environment: single core laptop, no I/O Base time for framework ~ 55 ns (Base Time) Seeding, method invocation, setting a value Computation time for generator 50+ ns (Gen Time) Cache update if referenced ~ 50 ns (Cache Update) Cache lookup if intra row reference ~ 50 ns (Cache Lookup) Sub-generator invocation ~ 50 ns
14
Performance Basic Generators Basic generators without formatting 120ns – 510ns
15
Performance Formatted Values Basic Generators with formatting Usually > 1000ns
16
Performance Meta Generators Meta generator overhead: Base overhead ~ 50 ns Generator overhead starts from 50 ns Sub generator invocation ~ 50ns Often negligible due to lazy formatting
17
Use Cases TPC-H / SSB 8 tables, 61 columns (first non-trivial example) Without meta-FVGs: 26 custom FVGs 2h editing: 10 custom FVGs 1 day reimplementation: 0 custom FVGs, i.e. no coding SSB variations skews on dimension attributes, fact measures, references TPC-DI (in process) 20 tables, 200 columns 19 custom FVGs (mainly for performance in corner cases) 56x NullGenerator 32x ProbabilityGenerator 3000 lines of config (XML import for multiple files).
18
Conclusion & Future Work Meta generators Improve usability and expressiveness Speed up schema definition Remove necessity for coding Enlarged configuration files Used in TPC benchmark(s) Performance overhead is small, often negligible Future work GUI and SQL export SQL import and data extraction
19
Thanks Questions? Contact: tilmann.rabl@utoronto.ca Download and try PDGF: http://www.paralleldatageneration.org Some big data info in our BigBench presentation Tuesday, 4pm, Industry 3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.