Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24,

Slides:



Advertisements
Similar presentations
Building Portals to access Grid Middleware National Technical University of Athens Konstantinos Dolkas, On behalf of Andreas Menychtas.
Advertisements

Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Michael Pizzo Software Architect Data Programmability Microsoft Corporation.
C6 Databases.
Big Data Working with Terabytes in SQL Server Andrew Novick
Data Manager Business Intelligence Solutions. Data Mart and Data Warehouse Data Warehouse Architecture Dimensional Data Structure Extract, transform and.
BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen.
Prentice Hall, Database Systems Week 1 Introduction By Zekrullah Popal.
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance TILMANN RABL, MEIKEL POESS, HANS- ARNO JACOBSEN, PATRICK.
Guide to Oracle10G1 Introduction To Forms Builder Chapter 5.
Chapter Physical Database Design Methodology Software & Hardware Mapping Logical Design to DBMS Physical Implementation Security Implementation Monitoring.
A Guide to Oracle9i1 Introduction To Forms Builder Chapter 5.
A Guide to SQL, Seventh Edition. Objectives Understand the concepts and terminology associated with relational databases Create and run SQL commands in.
8-1 Outline  Overview of Physical Database Design  File Structures  Query Optimization  Index Selection  Additional Choices in Physical Database Design.
Chapter 14: Advanced Topics: DBMS, SQL, and ASP.NET
Chapter 1 An Overview of Database Management. 1-2 Topics in this Chapter What is a Database System? What is a Database? Why Database? Data Independence.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Scott Pinkerton Sample GUI/Application Portfolio 1.
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
Data at the Core of the Enterprise. Objectives  Define of database systems.  Introduce data modeling and SQL.  Discuss emerging requirements of database.
Creating Data Marts from COBOL Files (ISAM to RDBMS)
BIS121 IT for Business Application Lecture 8 – Database (Part I)
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
ASP.NET Programming with C# and SQL Server First Edition
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor Ms. Arwa.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
Part 06 – A More Complex Data Model Entity Framework and MVC NTPCUG Tom Perkins.
Converting COBOL Data to SQL Data: GDT-ETL Part 1.
Database Technical Session By: Prof. Adarsh Patel.
OracleAS Reports Services. Problem Statement To simplify the process of managing, creating and execution of Oracle Reports.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
Project Implementation for COSC 4120 Database Applications Lab 3.
© 2007 by Prentice Hall 1 Introduction to databases.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Data Management Console Synonym Editor
Object Persistence Design Chapter 13. Key Definitions Object persistence involves the selection of a storage format and optimization for performance.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
SCUHolliday - coen 1788–1 Schedule Today u Modifications, Schemas, Views. u Read Sections (except and 6.6.6) Next u Constraints. u Read.
December 5, Repository Metadata: Tips and Tricks Peggy Rodriguez, Kathy Kimball.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
ITBIS373 Database Development Lecture 3a - Chapter 3: Using SQL Queries to Insert, Update, Delete, and View Data.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
DAT300 SQL Server Notification Services: Application Development Ken Henderson Technical Lead, SQL Server Support Microsoft Corporation
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
02 | Data Flow – Extract Data Richard Currey | Senior Technical Trainer–New Horizons United George Squillace | Senior Technical Trainer–New Horizons Great.
©Silberschatz, Korth and Sudarshan1 Structured Query Language (SQL) Data Definition Language Domains Integrity Constraints.
3 Copyright © 2010, Oracle. All rights reserved. Product Data Hub: PIM Functional Training Program Setup Workbench Fundamentals.
© 2012 Saturn Infotech. All Rights Reserved. Oracle Hyperion Data Relationship Management Presented by: Prasad Bhavsar Saturn Infotech, Inc.
1 Database Systems, 8 th Edition 1 Chapter 13 Business Intelligence and Data Warehouses Objectives In this chapter, you will learn: –How business intelligence.
Best Practices in Loading Large Datasets Asanka Padmakumara (BSc,MCTS) SQL Server Sri Lanka User Group Meeting Oct 2013.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Physical Layer of a Repository. March 6, 2009 Agenda – What is a Repository? –What is meant by Physical Layer? –Data Source, Connection Pool, Tables and.
Understand Data Definition Language (DDL) Database Administration Fundamentals LESSON 1.4.
Foundations of Business Intelligence: Databases and Information Management Chapter 6 VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors.
Aga Private computer Institute Prepared by: Srwa Mohammad
Fundamentals of DBMS Notes-1.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
IBM DATASTAGE online Training at GoLogica
Populating a Data Warehouse
Populating a Data Warehouse
Physical Database Design
Populating a Data Warehouse
Analysis models and design models
The Database Environment
Contents Preface I Introduction Lesson Objectives I-2
Presentation transcript:

Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24, New York City MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG

DBMS Benchmarking is Increasingly Complex Data Volumes are sky rocketing  Enterprise data warehouses double every three years  Many enterprise data warehouses are in petabyte size Systems are becoming increasingly complex  Large number of processor cores  Single systems (SMP) with high number of cores (80 on commodity hardware, 2048 on specialized hardware)  Multi node systems (sky is the limit)  Large memory  Dell released a TPC-H benchmark with 15 TB of main memory on 64 systems How to challenge these systems?

Benchmarks are increasingly complex More tables, columns More relationships, dependencies, data types, … How to build these benchmarks? Parallel Data Generation Framework to the rescue!

Parallel Data Generation Framework Generic data generation framework Relational model  Schema specified in configuration file  Post-processing stage for alternative representations Repeatable computation  Based on XORSHIFT random number generators  Hierarchical seeding strategy

Repeatable Data Generation

PDGF Architecture Controller  Initialization Meta Scheduler  Inter node scheduling Scheduler  Inter thread scheduling Worker  Blockwise data generation Update Black Box  Co-ordination of data updates Seeding System  Random sequence adaption Generators  Value generation Output system  Data formating To generate data for a schema the user defines:  Schema XML file  Defines relational schema  Generation XML file  Defines output format (CSV, XML, merging tables)

Configuring PDGF Schema configuration  Data model Relational model  Tables, fields Properties  Table size, characters, … Generators  Base generators  Meta generators Update definition  Insert, update, delete  Generated as change data capture ${S} <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> 0 true 9 Supplier [..]

Base Generators in PDGF DictList generator  Random line from file Long generator  Random long in interval Others  StaticValue  Double  Date  String  Text  … java.sql.types.VARCHAR 100 dicts/names.dict java.sql.types.NUMERIC 0 120

Null Generator Add NULL logic to every generator?  Could easily be implemented in higher class  Adds to the configuration file  Reduces performance (every time) Higher order generator NullGenerator  Only used if added to the schema  Can be added to any generator java.sql.types.NUMERIC

Meta Generators Control flow and post-processing generators  Null generator controls flow Post-processing  FormattedNumberGenerator  PaddingGenerator  UpperLowerCaseGenerator  PrePostfixGenerator  FormulaGenerator Flow control  ProbabilityGenerator  SequentialGenerator  IfGenerator  SwitchGenerator  ReferenceGenerator

Post-Processing Example Phone number for users  10s of representations  PhoneNumberGenerator was too inflexible Formatted long number  Long numbers between and  Number formatting ( %d%d%d) %d%d%d-%d%d%d%d java.sql.types.VARCHAR (%d%d%d) %d%d%d-%d%d%d%d

Flow Control Example More elaborate name field  Name male or female  50% chance  All upper case  Padded to 100 characters Sequential generator  Probability generator  DictList generator  UpperLowerCase generator  Padding generator java.sql.types.VARCHAR 100 dicts/female.dict dicts/male.dict uppercase true

Core Performance Test environment: single core laptop, no I/O Base time for framework ~ 55 ns (Base Time)  Seeding, method invocation, setting a value Computation time for generator 50+ ns (Gen Time) Cache update if referenced ~ 50 ns (Cache Update) Cache lookup if intra row reference ~ 50 ns (Cache Lookup) Sub-generator invocation ~ 50 ns

Performance Basic Generators Basic generators without formatting  120ns – 510ns

Performance Formatted Values Basic Generators with formatting  Usually > 1000ns

Performance Meta Generators Meta generator overhead:  Base overhead ~ 50 ns  Generator overhead starts from 50 ns  Sub generator invocation ~ 50ns Often negligible due to lazy formatting

Use Cases TPC-H / SSB  8 tables, 61 columns (first non-trivial example)  Without meta-FVGs: 26 custom FVGs  2h editing: 10 custom FVGs  1 day reimplementation: 0 custom FVGs, i.e. no coding  SSB variations  skews on dimension attributes, fact measures, references TPC-DI (in process)  20 tables, 200 columns  19 custom FVGs (mainly for performance in corner cases)  56x NullGenerator  32x ProbabilityGenerator  3000 lines of config (XML import for multiple files).

Conclusion & Future Work Meta generators  Improve usability and expressiveness  Speed up schema definition  Remove necessity for coding  Enlarged configuration files Used in TPC benchmark(s) Performance overhead is small, often negligible Future work  GUI and SQL export  SQL import and data extraction

Thanks Questions? Contact: Download and try PDGF: Some big data info in our BigBench presentation  Tuesday, 4pm, Industry 3