Download presentation
Presentation is loading. Please wait.
Published byMilo Summers Modified over 8 years ago
1
EvoGen: a Generator for Synthetic Versioned RDF Marios Meimaris Institute for the Management of Information Systems Research Center “Athena” m.meimaris@imis.athena-innovation.gr 1 Meimaris@DIACHRON2016
2
Data Web Evolving – Dynamic communities – Fast-paced environments – Open-world data Meimaris@DIACHRON2016 2
3
Problem Tackled Synthetic data widely used for benchmarking – Storage – Querying – Processing Lack of tools and benchmarks for evolving RDF – Versioning Systems – Evolution Management Systems – Change Detection – insert yours here… Meimaris@DIACHRON2016 3
4
Requirements Meaningful data generation – Synthetic data generation abstraction – Identification of characteristics Configurability – Definition of parameters based on characteristics Benchmark workload Community engagement Meimaris@DIACHRON2016 4
5
Parameters We define three non-exhaustive, non- mutually exclusive parameters to drive the generation process – Shift – Monotonicity – Strictness Meimaris@DIACHRON2016 5
6
Parameters Meimaris@DIACHRON2016 6
7
Parameters Meimaris@DIACHRON2016 7
8
Parameters Meimaris@DIACHRON2016 8
9
Lehigh University Benchmark Widely used synthetic data generator Creates universities that contain departments with students, professors, courses etc. Configurable number of universities and starting index Configurable serialization and representation model (RDF/XML in.owl files, DAML) Widely adopted by the data engineering and semantic web community Meimaris@DIACHRON2016 9
10
Lehigh University Benchmark Meimaris@DIACHRON2016 10 http://blog.andric.name/wp-content/uploads/2013/06/univ-bench.owl_.png
11
Our system A generator for synthetic evolving RDF data – Based on existing LUBM generator – Extends LUBM to create evolving versions of original data – Tailors creation process based on user defined parameters – This version: monotonic shifts – Next version: configurable strictness % Meimaris@DIACHRON2016 11
12
Our system Configurable parameters – # of universities – # of consecutive versions – shift (double precision, w.r.t. first-version dataset) Shift is distributed evenly among versions All dataset classes are generated based on weight factors – serialization mode (full vs diffs) Next version – Strictness as % of Characteristic Sets generated from LUBM, spread over versions – Custom query workload Meimaris@DIACHRON2016 12
13
Resulting Data Based on Lehigh University Benchmark (LUBM) User defines: – shift as a positive or negative percentage – number of versions to be created LUBM schema classes are given weights based on their contribution to the dataset’s size Shift percentage is distributed to all LUBM classes based on their weights and the defined shift Meimaris@DIACHRON2016 13
14
System Architecture Meimaris@DIACHRON2016 14
15
Evaluation of Shift Parameter Measure achieved shift w.r.t to desired for increasing number of unis Meimaris@DIACHRON2016 15
16
Further resources Lehigh University Benchmark (LUBM) – http://swat.cse.lehigh.edu/projects/lubm/ Source code repository – https://github.com/mmeimaris/EvoGen DIACHRON@EDBT paper – http://ceur-ws.org/Vol-1558/paper9.pdf Meimaris@DIACHRON2016 16
17
Example of usage User defines: – 5 universities – 10 versions – 0.3% incremental change evenly distributed between versions Meimaris@DIACHRON2016 17
18
Thank you Questions? 18 Meimaris@DIACHRON2016
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.