Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
RDF and RDB 1 Some slides adapted from a presentation by Ivan Herman at the Semantic Technology & Business Conference, 2012.
© Copyright 2012 STI INNSBRUCK Apache Stanbol.
WIMS 2011, Sogndal, Norway1 Comparison of Ontology Reasoning Systems Using Custom Rules Hui Shi, Kurt Maly, Steven Zeil, and Mohammad Zubair Contact:
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Analyzing Minerva1 AUTORI: Antonello Ercoli Alessandro Pezzullo CORSO: Seminari di Ingegneria del SW DOCENTE: Prof. Giuseppe De Giacomo.
Michael Povolotsky CMSC491s/691s. What is Virtuoso? Virtuoso, known as Virtuoso Universal Server, is a multi-protocol RDBMS Includes an object-relational.
Semantic Web Tools Vagan Terziyan Department of Mathematical Information Technology, University of Jyvaskyla ;
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham.
Triple Stores.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Semantic Web. Course Content
Database Support for Semantic Web Masoud Taghinezhad Omran Sharif University of Technology Computer Engineering Department Fall.
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Pavan Reddiavri (Ebiquity Labs) “R ♫ P” RDF Access control Policies.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Oracle Database 11g Semantics Overview Xavier Lopez, Ph.D., Dir. Of Product Mgt., Spatial & Semantic Technologies Souripriya Das, Ph.D., Consultant Member.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
DReSS Engineering a Replay Application Based on RDF and OWL Chris Greenhalgh, Andy French, Jan Humble, Paul Tennent School of Computer Science, University.
RDF languages and storages part 1 - expressivness Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Web Information Systems Modeling Luxembourg, June VisAVis: An Approach to an Intermediate Layer between Ontologies and Relational Database Contents.
Text Mining & NLP based Algorithm to populate ontology with A-Box individuals and object properties Alexandre Kouznetsov and Christopher J. O. Baker, University.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Scalable Keyword Search on Large RDF Data. Abstract Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Steven Seida How Does an RDF Knowledge Store Compare to an RDBMS?
R Store Angelique Moscicki Oshani Seneviratne Sergio Herrero-Lopez.
Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
CMPE58H Project Progress Presentation QAPoint H.Tuğçe Özkaptan Gözde Kaymaz Serkan Kırbaş
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
RDF storages and indexes Maciej Janik September 1, 2005 Enterprise Integration – Semantic Web.
RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.
Scalable and E ffi cient Reasoning for Enforcing Role-Based Access Control Tyrone Cadenhead Advisors: Murat Kantarcioglu, and.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Managing Large RDF Graphs Vaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science, The University of Texas at Dallas December 2008.
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
CS 405G: Introduction to Database Systems
Triple Stores.
Chapter 15 QUERY EXECUTION.
Triple Stores.
Scalable and Efficient Reasoning for Enforcing Role-Based Access Control
Interpret the execution mode of SQL query in F1 Query paper
Scalable and Efficient Reasoning for Enforcing Role-Based Access Control
Prof. Bhavani Thuraisingham The University of Texas at Dallas
Chaitali Gupta, Madhusudhan Govindaraju
Scalable and Efficient Reasoning for Enforcing Role-Based Access Control
Triple Stores.
A framework for ontology Learning FROM Big Data
Presentation transcript:

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering

Managing Large RDF Graphs  Agenda  Motivation behind the project  Semantic web technologies overview  Proposed architecture  Performance metrics FEARLESS engineering

Managing Large RDF Graphs  Motivation - Current Problems  Jena’s in-memory model does not scale  Jena’s RDB and SDB models cannot handle large result sets  Hinders ability to do reasoning and large graph processing  Current work focuses on load balancing and fault tolerance  Current systems can be broken with even 100,000 triples  We work on load balancing and polynomial reasoning but memory management breaks systems before any other problems can be addressed FEARLESS engineering

Managing Large RDF Graphs  Motivation - Relevance of the problem  This is an unsolved problem  Critical in handling terabytes of data relevant in today’s times  Move the problem from memory space to disk space FEARLESS engineering

Managing Large RDF Graphs FEARLESS engineering Jena In-memoryRDBSDBARQ Extension Reasoning

Managing Large RDF Graphs  Semantic web technologies overview - Jena  Jena is a Java based framework that allows building Semantic web applications  Jena provides a programmatic environment for RDF, RDFS, OWL, SPARQL and includes a rule based inference engine  Jena allows the creation and manipulation of in-memory or relational database backed (RDB and SDB) RDF graphs FEARLESS engineering

Managing Large RDF Graphs  Semantic web technologies overview - Lucene  Lucene is a Java based text indexing and searching tool  The smallest unit of text that Lucene indexes and searches is a Document  A Document contains different fields and a corresponding value for each field  The different fields are the indexes that can be used as keywords during a search FEARLESS engineering

Managing Large RDF Graphs  Problems with In-memory Jena Model  Ability to handle medium sized graphs  As nodes are added memory fills up  As more nodes are added, the program crashes with an out of memory exception  We want to solve this out of memory problem FEARLESS engineering

Managing Large RDF Graphs FEARLESS engineering 5. Continue adding triples 3. Buffer sorted based on memory management algorithm 4. Write triples based on sorted buffer while triples left > x  of Threshold 2. Added triples = Threshold 1. Add triples In-memory triple store + buffer Lucene triple store Buffer Management Strategy

Managing Large RDF Graphs FEARLESS engineering 4. Return result 3. Return result 2. If result not in memory query Lucene triple store 1. Query model In-memory triple store Lucene triple store

Managing Large RDF Graphs  Choice of Algorithm  Memory management algorithms such as LRU, MRU, FIFO, and LIFO  Social network analysis measures such as degree centrality and individual clustering coefficient  Combination of memory management algorithm with degree centrality and individual clustering coefficient FEARLESS engineering

Managing Large RDF Graphs FEARLESS engineering

Managing Large RDF Graphs  Choice of buffer and persistence strategy  Buffer can be created based on the subject, predicate, object or a combination of them  Map Jena’s subject, predicate and object indexes to Lucene indexes directly  Create Lucene indexes as needed taking into account the nature of SPARQL queries and Jena’s implementation FEARLESS engineering

Managing Large RDF Graphs FEARLESS engineering

Managing Large RDF Graphs  Conclusions from the in-memory model  Degree centrality is the best algorithm to choose a node to be persisted to disk  Creating Lucene indexes as needed is a better choice for the persistence strategy than creating all indexes at the same time FEARLESS engineering

Managing Large RDF Graphs  Problems with RDB Jena model  The RDB Jena model can add any number of triples to the relational database  When a query asking for a large number of triples is executed, the result set returned fills up memory causing the program to crash with an out of memory exception  We want to solve this out of memory problem  We leverage the previous in-memory extension to solve this problem FEARLESS engineering

Managing Large RDF Graphs  Memory management algorithm  Algorithm  We use the LIMIT and OFFSET clauses in SQL to get only a part of the results at a time  The retrieved triples are added to the extended in-memory Jena model  Thus we use the memory management algorithm from the in-memory model  Since the revised in-memory model never runs out of memory this RDB solution never runs out of memory FEARLESS engineering

Managing Large RDF Graphs  Conclusions  Conclusions from the extended RDB model  Model creation times are similar to the original RDB Jena model  Query times vary based on the threshold value in the in-memory solution  General conclusions  Implemented an in-memory cache based memory management algorithm  Solves the memory problem for the in-memory and RDB Jena models by creating an impression of infinite memory for the user  Moves the memory problem to disk space FEARLESS engineering

Managing Large RDF Graphs  Problems with SDB Jena Model  The SDB Jena model can add any number of triples to the relational database  When a query asking for a large number of triples is executed, the result set returned fills up memory causing the program to crash with an out of memory exception  We want to solve this out of memory problem  The SDB solution does not depend on the in-memory or RDB extensions FEARLESS engineering

Managing Large RDF Graphs  Memory management algorithm  Algorithm  We use the LIMIT and OFFSET clauses in SQL to get only a part of the results at a time  The retrieved triples are returned as a separate iterator to the executing program FEARLESS engineering

Managing Large RDF Graphs  Inferencing in Semantic Web  Ontology specification - TBox  Instance creation - ABox  Inference - Generating new triples based on instances in the Abox backed by the TBox FEARLESS engineering

Managing Large RDF Graphs  Problems in inferencing with this extension  How do you do reasoning when the graph is divided between memory and disk ??  Scalability FEARLESS engineering

Managing Large RDF Graphs FEARLESS engineering YesNo Continue adding triples 2. Buffer sorted based on memory management algorithm 3. Write triples based on sorted buffer while triples left > x  of Threshold 1. Added triples = Threshold Add triples In-memory triple store + buffer Lucene triple store Buffer Management Strategy Is triple a part of TBox?? Triple store In-memory triple store

7. Return result Managing Large RDF Graphs FEARLESS engineering 2. Get TBox triples 1. Query 6. Return result 5. Return result 4. If result not in memory query Lucene triple store 3. Query for ABox triples In-memory triple store Lucene triple store Pellet Reasoner In-memory triple store

Managing Large RDF Graphs  Choice of Algorithm  Memory management algorithms such as LRU, MRU, FIFO, and LIFO  Social network analysis measures such as degree centrality and individual clustering coefficient  Combination of memory management algorithm with degree centrality and individual clustering coefficient FEARLESS engineering

Managing Large RDF Graphs FEARLESS engineering

Managing Large RDF Graphs  Choice of buffer and persistence strategy  Buffer can be created based on the subject, predicate, object or a combination of them  Map Jena’s subject, predicate and object indexes to Lucene indexes directly  Create Lucene indexes as needed taking into account the nature of SPARQL queries and Jena’s implementation FEARLESS engineering

Managing Large RDF Graphs FEARLESS engineering

Managing Large RDF Graphs  Conclusions from the inference model  RANDOM is the best algorithm to choose a node to be persisted to disk  Creating all Lucene indexes at the same time is a better choice for the persistence strategy than creating the indexes one at a time FEARLESS engineering

Managing Large RDF Graphs  Future Work  Test all models with benchmark data  Generalize the algorithm to be able to handle multiple incarnations of nodes over time  Improve the efficiency of all algorithms  Try other algorithms for selecting the candidate node to be written to disk FEARLESS engineering