EvoGen: a Generator for Synthetic Versioned RDF Marios Meimaris Institute for the Management of Information Systems Research Center “Athena” 1
Data Web Evolving – Dynamic communities – Fast-paced environments – Open-world data 2
Problem Tackled Synthetic data widely used for benchmarking – Storage – Querying – Processing Lack of tools and benchmarks for evolving RDF – Versioning Systems – Evolution Management Systems – Change Detection – insert yours here… 3
Requirements Meaningful data generation – Synthetic data generation abstraction – Identification of characteristics Configurability – Definition of parameters based on characteristics Benchmark workload Community engagement 4
Parameters We define three non-exhaustive, non- mutually exclusive parameters to drive the generation process – Shift – Monotonicity – Strictness 5
Parameters 6
Parameters 7
Parameters 8
Lehigh University Benchmark Widely used synthetic data generator Creates universities that contain departments with students, professors, courses etc. Configurable number of universities and starting index Configurable serialization and representation model (RDF/XML in.owl files, DAML) Widely adopted by the data engineering and semantic web community 9
Lehigh University Benchmark 10
Our system A generator for synthetic evolving RDF data – Based on existing LUBM generator – Extends LUBM to create evolving versions of original data – Tailors creation process based on user defined parameters – This version: monotonic shifts – Next version: configurable strictness % 11
Our system Configurable parameters – # of universities – # of consecutive versions – shift (double precision, w.r.t. first-version dataset) Shift is distributed evenly among versions All dataset classes are generated based on weight factors – serialization mode (full vs diffs) Next version – Strictness as % of Characteristic Sets generated from LUBM, spread over versions – Custom query workload 12
Resulting Data Based on Lehigh University Benchmark (LUBM) User defines: – shift as a positive or negative percentage – number of versions to be created LUBM schema classes are given weights based on their contribution to the dataset’s size Shift percentage is distributed to all LUBM classes based on their weights and the defined shift 13
System Architecture 14
Evaluation of Shift Parameter Measure achieved shift w.r.t to desired for increasing number of unis 15
Further resources Lehigh University Benchmark (LUBM) – Source code repository – paper – 16
Example of usage User defines: – 5 universities – 10 versions – 0.3% incremental change evenly distributed between versions 17
Thank you Questions? 18