Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University.

Slides:

Advertisements

Similar presentations

1 ICS-FORTH EU-NSF Semantic Web Workshop 3-5 Oct Christophides Vassilis Database Technology for the Semantic Web Vassilis Christophides Dimitris Plexousakis.

Advertisements

Q UERY L ANGUAGE C ONSTRUCTS FOR P ROVENANCE Murali Mani, Mohamad Alawa, Arunlal Kalyanasundaram University of Michigan, Flint Presented at IDEAS 2011.

ICS-FORTH May 23, An Ontological Approach to Digital Preservation Metadata Martin Doerr Foundation for Research and Technology - Hellas Institute.

Open Provenance Model Tutorial Session 3: OPM Serializations Luc Moreau University of Southampton.

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X.

Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.

+ Hbase: Hadoop Database B. Ramamurthy. + Introduction Persistence is realized (implemented) in traditional applications using Relational Database Management.

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Semantic Web Query Processing with Relational Databases Artem Chebotko Department of Computer Science Wayne State University.

Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein Department of Informatics, University of Zurich Summarized by: Arpit Gagneja.

CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.

+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.

Project By: Anuj Shetye Vinay Boddula. Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion.

Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle

Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.

AN INTRODUCTION TO NOSQL DATABASES Karol Rástočný, Eduard Kuric.

Systems analysis and design, 6th edition Dennis, wixom, and roth

Hexastore: Sextuple Indexing for Semantic Web Data Management

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Database Support for Semantic Web Masoud Taghinezhad Omran Sharif University of Technology Computer Engineering Department Fall.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

What is Big Data? Bid Data extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.

Relational Databases to RDF (a.k.a RDB2RDF) Juan F. Sequeda Dept of Computer Science University of Texas at Austin.

On Data Provenance in Group-centric Secure Collaboration Oct. 17, 2011 CollaborateCom Jaehong Park, Dang Nguyen and Ravi Sandhu Institute for Cyber Security.

Querying with SPARQL Tuesday, October 28, 2014 Technical Exchange on Network Management Interoperability Andrea Westerinen JHU-APL/Nine Points Solutions.

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.

GStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu ¨, Dongyan Zhao {

+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:

Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.

Discussion MySQL&Cassandra ZhangGang 2012/11/22. Optimize MySQL.

A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.

SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.

Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.

Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.

Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.

IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.

Nov 2006 Google released the paper on BigTable.

NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.

RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

ONTOLOGY-DRIVEN DISCOVERY OF SCIENTIFIC COMPUTATIONAL ENTITIES Pearl Brazier Department of Computer Science University of Texas-Pan American November 2,

Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:

BIG DATA/ Hadoop Interview Questions.

Copyright © 2015 Oracle and/or its affiliates. All rights reserved. How Can RDF and OWL Coexist with Property Graph Zhe Wu Architect Oracle Spatial and.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

CS 405G: Introduction to Database Systems

CS122B: Projects in Databases and Web Applications Winter 2017

Distributed Storage and Querying Techniques for a Semantic Web of Scientific Workflow Provenance The ProvBase System Artem Chebotko (joint work with.

David Ostrovsky | Couchbase

NOSQL databases and Big Data Storage Systems

Hadoop and NoSQL at Thomson Reuters

NoSQL Systems Overview (as of November 2011).

NOSQL and CAP Theorem.

Introduction to Apache

Database Systems Summary and Overview

On Provenance of Queries on Linked Web Data

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University of Texas – Pan American Anthony Piazza Piazza Consulting Andrey Kashlev and Shiyong Lu Wayne State University 7th IEEE International Workshop on Scientific Workflows, July 2, 2013 Was Derived From 1

Provenance in eScience  Metadata that captures history of an experiment  Problem diagnosis  Result interpretation  Experiment reproducibility  Scientific Workflow Community Provenance Challenges  2006: understanding and sharing information about provenance representations and capabilities  2006: interoperability of different provenance  2009: evaluating various aspects of OPM  2010: showcase OPM in the context of novel applications  Open Provenance Model ( )  PROV-DM: The PROV Data Model (W3C Recommendation 30 April 2013) 2

SWFMS and Provenance  Taverna  Kepler  View  VisTrails,  Pegasus  Swift  Galaxy  Triana  OPMProv  Karma  RDFProv  etc.  Support provenance collection  Use proprietary or third-party systems to manage provenance  Differ in provenance models, provenance vocabularies, inference support, and query languages.  May eventually converge to W3C PROV specifications 3

Sample OPM Provenance Graph  Nodes:  artifacts  processes  agents  Edges:  used  wasGeneratedBy  wasControlledBy  wasTriggeredBy  wasDerivedFrom 4

Sample Graph Serialization: OPMV and Terse RDF Triple Language utpb:schema rdf:type opmv:Artifact. utpb:instance rdf:type opmv:Artifact. utpb:dataset rdf:type opmv:Artifact. utpb:loadData rdf:type opmv:Process. utpb:loadData opmv:used utpb:schema, utpb:dataset. utpb:instance opmv:wasGeneratedBy utpb:loadData. utpb:instance opmv:wasDerivedFrom utpb:schema, utpb:dataset. 5

Provenance Serialization and Querying  Both OPM and PROV-DM can be serialized in RDF  Queried in SPARQL Find all artifacts and their values, if any, in a provenance graph with identifier 6

This Work - Motivation  Single provenance graph as an RDF graph  In general, readily manageable in main memory of a single machine  Hundreds of thousands or even millions of provenance graphs as a provenance (RDF) dataset  Challenging to manage  Our Focus/Problem: Efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs (in an Apache HBase database) 7

This Work - Contributions  Novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets  Novel and efficient querying algorithms to evaluate SPARQL queries in HBase that are optimized to make use of bitmap indices and numeric values instead of triples  Empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark (UTPB) 8

Talk Outline  RDF Data and Queries  Indexing Scheme  Storage Scheme  Query Processing  Performance Study  Related Work  Summary and Future work 9

RDF Data and Queries 10

RDF Data and Queries 11

Indexing Scheme  Selection Indices: I s, I p, I o  Find a triple with known s, p and o: 12

Indexing Scheme  Join Indices: I ss, I so, I os, I oo  Find triples with the same object as subject in triple at position i: I so (i) 13

Storage Scheme  One table with two column families for data and indices  Each row stores one complete provenance graph 14

Query Processing  Four efficient algorithms/functions:  application of selection indices  application of join indices  handling of special cases not supported by the indices  basic graph pattern evaluation 15

Query Processing 16

Query Processing 17

Query Processing 18

Query Processing 19

Query Processing 20

Query Processing 21

Performance Study  Implementation  Java, Hadoop 1.0.0, HBase 0.94  Cluster setup  One HBase Master  Eight HBase Region Servers  All commodity machines  Benchmark – UTPB (5 datasets, 11 queries) 22

Performance Study  Q1 – simplest, yet most expensive query due to a large result set  Q1. Find all provenance graph identifiers. PREFIX rdf: PREFIX owl: SELECT * WHERE { ?graph rdf:type owl:Thing. } 23

Performance Study  Q2 – Q11 – different complexity, yet similar performance  Example: Q8. Find all artifacts and their values, if any, in a particular provenance graph. PREFIX opmv: PREFIX rdf: PREFIX opmo: PREFIX utpb: SELECT ?artifact ?value F ROM NAMED WHERE { GRAPH utpb:opmGraph { ?artifact rdf:type opmv:Artifact. OPTIONAL { ?artifact opmo:annotation ?annotation. ?annotation opmo:property ?property. ?property opmo:value ?value. }. OPTIONAL { ?artifact opmo:avalue ?artifactValue. ?artifactValue opmo:content ?value. }. } 24

Performance Study  Please see other queries in the paper – very efficient and scalable (nearly constant scalability due to minimal data transfers and fast index-based join processing) 25

Related Work  HBase, BigTable, Cassandra  Hadoop, Hive, Pig, CouchDB, MongoDB, etc.  NoSQL solutions to RDF data management  Provenance management systems  RDF data indexing 26

Summary and Future Work  Designed novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets  Empirical evaluation results are promising  Future work  Compare, compare, compare  More experiments with multi-user workloads  More optimizations  PROV-DM benchmark anyone? 27

THANK YOU! Questions?  My contact information:  Artem Chebotko, Department of Computer Science, University of Texas – Pan American   WasDerivedFrom