CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT PNNL:David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric.

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT PNNL:David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric Goodman, Edward Jimenez, Greg Mackey 1

HPC applied to Semantic Graph Databases 2 Result Set Expressing queries as a graph SPARQL SGD as an Appliance (Front end) User Interface Billion Triple size datasets Extant Ontological Scaling Motif Analysis Analysis Dictionary Encoding Materialized Inference Paging graph portions / dictionary Data Storage & Manipulation Search Processing Approach Query Optimization On-the-fly Inferencing Search / Query

Outline Introduction (David Haglin) Accomplishments Focus this review: Query Search Process OWL Rules, Subgraph Isomorphism, Sprinkle-SPARQL (Eric Goodman) Generic Forward-Inferencing Capability(David Mizell) Graph Analysis and Extant Ontology (Sinan al-Saffar) What next? (David Haglin)

Accomplishments Accepted Papers: Eric Goodman, Edward Jimenez, David Mizell, Sinan al-Saffar, Bob Adolf, and David Haglin. “High-performance Computing Applied to Semantic Databases”. Extended Semantic Web Conference (ESWC 2011), May 2011. (23% acceptance rate) Submissions: Cliff Joslyn, Bob Adolf, Sinan al-Saffar, John Feo, Eric Goodman, David Haglin, Greg Mackey, and David Mizell. “High Performance Descriptive Semantic Analysis of Semantic Graph Databases”. Workshop on High-Performance Computing for the Semantic Web, ESWC 2011, May 2011. Sinan al-Saffar, Cliff Joslyn, Alan Chappell. “Extant Ontological Scaling and Descriptive Semantics for Semantic Structure Discovery in Large Graph Datasets.” IEEE/WIC/ACM International Conference on Web Intelligence. Workshops Organized: HPCSW – Most of task 3 personnel on program committee. Complex Query Workshop – scheduled for April 25/26 in Seattle, WA Hybrid Database Planning Technical Meeting: Battelle Seattle Research Center, February 2011 UW (Howe, Shaw), PNNL (CASS/SDB and TAI), SNL

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. CASS-MT Quarterly Review Task 3: Semantic Databases on the XMT Eric Goodman Edward Jimenez Greg Mackey Update April 2011

Sprinkle SPARQL Sprinkle SPARQL presented in ESWC paper Paucity of scalability results in literature –10 nodes running MapReduce –1 node running BigOWLIM Note: MapReduce method did not operate on inferred set. They hand-encoded expanded queries to catch the possibilities.

LUBM Query 1 SELECT ?X WHERE {?X rdf:type ub:GraduateStudent} {?X ub:takesCourse http:www.Department0.University0.edu/GraduateCourse0} All the Graduate Students All the Students that took a particular course 20,157,119 matches 4 matches

Sprinkle phase 00000000000000000000 Create an array the same size as the order of the graph for each variable in each BGP Process each BGP –If node fulfills constraint of BGP, increment counter in associated array for the variable The point: Constrain the problem before we start joining

Sprinkle phase 0 All the Students that took a particular course 1000000000000100000

Sprinkle phase 1 All the Students that took a particular course 2010110010110201011 All the Graduate Students

Future Query Work Spinkle-SPARQL –In-depth analysis –More discriminating use of Sprinkle Comparison to other approaches –MTGL subgraph isomorphism algorithm –Approach from Bob Adolf and David Haglin –Array-based method from David Mizell for SC10 demo

Inference Work Multimap data structure OWL Horst rules –rdfp4: Transitivity –rdfp8: InverseOf –rdfp12: Equivalent Classes –rdfp15: SomeValuesFrom –These are the set of rules required for LUBM

Multimaps A mapping between keys and multiple values Comes up often in RDFS/OWL inferencing –Class hierarchies –Property hierarchies –SameAs relationships –Indices to find triples with certain subjects, predicates, or objects

Multimap: First Loop 1 1 2 3 1 1 4 1 Keys Index Counter Key External to Class Inside Multimap Class 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 21 3 1 0 1 0 0

Multimap: First Loop 1 1 2 3 1 1 4 1 Keys Index Counter Key External to Class Inside Multimap Class 0 1 0 0 0 1 0 0 0 1 5 0 1 2 3 0 1 0 0 4 2 3 0 4

Multimap: Initialize Storage 1 1 2 3 1 1 4 1 Keys Index Counter Key External to Class Inside Multimap Class 0 1 0 0 0 1 0 0 0 1 5 0 1 2 3 0 1 0 0 4 2 3 0 4 Values

Results Data set: ~5B Zipfian Integers –Value was “1” for each key Total time at 128 Processors –Old Method: 23.5 seconds 208e6 inserts/second –New Method: 11.5 seconds 422e6 inserts/second Comparison to hashing –5.5 seconds –878e6 inserts/second Speedup from 2 to 128 (ideal 64x) –Old: 37x –New: 53x Note: Had to grab class member variables and pass them back in to get good scaling.

OWL Horst Preliminary Results Aproximate Inference Rate on 64 Procs Urbani (with IO)225,000 – 340,000 Mizell and Rickett (no IO) ~1,000,000 RDFS (with IO)5,800,000 rdfp4 (no IO)3,000,000 rdfp8 (no IO)59,000,000 rdfp12ab (no IO)37,000,000 rdfp15 (no IO)8,700,000

Future Inference Work Compare with Chris Rickett and David Mizell’s strategy Prepare submission for ISWC 2011 (June deadline) Move to on-the-fly inference

Develop an automated or semi-automated process for extracting the ontology from an RDF triples database translating the ontological rules into a simple syntax, eg Jena Rules using the translation to perform forward (later backward) inferencing on the database

… ( ?x is-a Cray-employee ) -> ( ?x has-a cell-phone ) … ( Cray-employee subset-of US-citizen ) … ( David is-a Cray-employee ) … ( Shoaib is-a Cray-employee ) … New, inferred triples: Ontology rules Triples database ( David has-a cell-phone ) ( Shoaib has-a cell-phone ) ( David is-a US-citizen ) ( Shoaib is-a US-citizen ) Get applied to… …(also called “materialization”)

Take each rule ( ?x is-a Cray-employee ) -> ( ?x has-a cell-phone ) Search the database for triples that match the left-hand side of the rule ( ?x is-a Cray-employee ) ( David is-a Cray-employee ) Add the new triple(s) to the database corresponding to the right-hand side ( David has-a cell-phone ) (worst case) repeat until you reach a fixed-point

( ?x is-a Cray-employee ) && ( ?x is-a manager ) -> ( ?x has-a Blackberry ) ( Shoaib is-a Cray-employee ) … ( Shoaib is-a manager ) … JOIN

Goodman and Mizell, “Scalable In-Memory Closure on Billions of Triples,” International Workshop on Scalable Semantic Web Knowledge Bases, at the International Semantic Web Conference, Shanghai, Nov. 2010 RDFS is a standard ontology with 13 rules. 6 of these have 2 triple patterns on the left-hand side (require join-like processing). We only used those. Wrote 6 functions with the same overall structure: Search the database for matches to the left-hand side Add the implied triples Eric cleverly scheduled the application of these functions to avoid fixpoint iteration

Castagna, Dollin and Seaborne, “Vivisecting LUBM,” HP Laboratories, HPL-2009-348, Nov. 6, 2009 What the HP Labs researchers did: Extracted the LUBM ontology rules Re-wrote them in “Jena Rules” format Applied them in “streaming” fashion to the LUBM database :Chair A owl:Class ; rdfs:label "chair" ; rdfs:subClassOf :Professor ; owl:intersectionOf ( :Person [ a owl:Restriction ; owl:onProperty :headOf ; owl:someValuesFrom :Department ]) (?x rdf:type ub:Chair) -> (?x rdf:type ub:Professor). (?x rdf:type ub:Person) (?x ub:headOf ?y) (?y rdf:type ub:Department)-> (?x rdf:type ub:Chair). (?x rdf:type ub:Chair) -> exists ?y : (?x rdf:type ub:Person) (?x ub:headOf ?y) (?y rdf:type ub:Department).

Grabbed their Jena-formatted rules from the paper’s appendix Chris wrote a parser for the rules, converted them to triples-pattern (integer) data structure (using Eric Goodman’s “dictionary”) Iterated through the rules until no new triples were added Recently, I tuned the inferencer by substituting a hash table specialized to integer triples (written by Eric Goodman) – used for duplicate elimination Time on LUBM8000, 1.1B triples before, 1.7B after (just inferencing, no I/O): 350 sec/128p; 185 sec/256p 148 sec/512p (?x rdf:type ub:Course) -> (?x rdf:type ub:Work). (?x rdf:type ub:Research) -> (?x rdf:type ub:Work). (?x rdf:type ub:GraduateCourse) -> (?x rdf:type ub:Course) (?x rdf:type ub:Work). (?x rdf:type ub:UndergraduateStudent) -> (?x rdf:type ub:Student). (?x rdf:type ub:ResearchAssistant) -> (?x rdf:type ub:Student). (?x rdf:type ub:GraduateStudent) -> (?x rdf:type ub:Person). (?x rdf:type ub:Faculty) -> (?x rdf:type ub:Employee). (?x rdf:type ub:Professor) -> (?x rdf:type ub:Faculty) (?x rdf:type ub:Employee). (?x rdf:type ub:AssistantProfessor) -> (?x rdf:type ub:Professor) (?x rdf:type ub:Faculty) (?x rdf:type ub:Employee). (?x rdf:type ub:AssociateProfessor) -> (?x rdf:type ub:Professor) (?x rdf:type ub:Faculty) (?x rdf:type ub:Employee). (?x rdf:type ub:Dean) -> (?x rdf:type ub:Professor) (?x rdf:type ub:Faculty) (?x rdf:type ub:Employee). (?x rdf:type ub:FullProfessor) -> (?x rdf:type ub:Professor) (?x rdf:type ub:Faculty) (?x rdf:type ub:Employee). (?x rdf:type ub:Chair) -> (?x rdf:type ub:Professor) (?x rdf:type ub:Faculty) (?x rdf:type ub:Employee). …

How does this performance compare to the specific function-per-rule approach? Is there a programmer time vs. execution time tradeoff? How generalizable is this “generic” approach? Jena Rules are easy to parse, but… Semantics can be quite tricky Usually will have to combine some custom, database-specialized rules with a standard ontology such as RDFS OWL Lite OWL DL … What we learn from this may help us with on-the-fly (backwards) inferencing in the future

Graph Analysis and Extant Ontology 31 Sinan al-Saffar, Cliff Joslyn

Informing the Design of a Future Database Engine Similarly to relational databases, in order to optimize any future graph database engine, we need to understand: Graph Content and Structure Queries and Inference Why? Because these influence the data structure and algorithms of choice to achieve efficient time and space utilization This has to happen in both: The overall design, And as a dynamic query optimization component

Graph-O-Scope We built a set of functions that compute statistical measures to help us understand the contents of semantic graphs The intention is to re-implement these functions in an API that is to be used from within a dynamic query optimization module Some of the Statistics: Edge and nodes counts and graph density Literal, blank, and URI counts with break-downs by subjects/object Predicate and class distributions Counts of reification and ontological components In-degree / out-degree dist Connected component sameAs cliques

What is in the graph? Question: How can we “understand” a 2 billion edge graph? We looked at three large datasets: BTC is a result of a semantic web crawl Uniprot is a ten year, primary bioinformatics reference LUBM is a synthetic dataset Dataset # Edges BTC2010 1.4 b UNIPROT2.04 b LUBM8K1.07 b

Reification OriginalReification BTC20101.4b24.21m UNIPROT2.04b554.86m LUBM8K1.07b0 Discovery: A good chunk of the data is refied Design: Make database a hybrid Primary Statement Annotation

Terminal Edges Discovery: Literal nodes and edge constitute a good size of the data Design: Implement literals as node properties (outside the graph) OriginalAfter Removing Literals BTC2010 #edges #nodes 1.4 b 281 m 0.53 b 221 m UNIPROT #edges #nodes 2.04 b 461 m 1.4 b 404 m LUBM8K #edges #nodes 1.07 b 263 m 0.71 b 174 m

Class Coverage (BTC) Discovery: 168k classes but 16 cover 80%, 64 cover 95% of the data Design: Implement types as node property (huge effect on inference)

Predicate Coverage (BTC) Discovery: 95k different predicates but 64 cover 86% of the data Design: optimize graph data structure for a small range of edge labels

UniProt Extant Ontology I A 243-edge graph as a statistical representation of the present semantic structures in 2b-edge Uniprot graph

Uniprot Extant Ontology I zoomed-in

Extant Ontological Scaling I

Extant Ontological Scaling II

Extant Ontological Scaling III

Level 1 Extant of Uniprot – Scaling applied rdfs:seeAlso 51.82%

Future Work: specific directions Continue working with Larry Holder (WSU) to find common ground on frequent subgraph mining and semantic database query Work with Bill Howe on query language and hybrid search strategies Expand our collaboration with Task 1. Support Task 16 (Mayo) Engage with Bioinformatics domain to find/build interestingly large and complex Bio dataset (i.e., more complex than uniprot) Find collections of complex queries Continue work on search engine comparison: Array-based Subgraph-isomorphism (MTGL) Sprinkle-SPARQL Explore query optimization strategies Extend study of larger path types (n=4,5) and/or non-linear motifs

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT PNNL:David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric.

Similar presentations

Presentation on theme: "CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT PNNL:David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT PNNL:David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric.

Similar presentations

Presentation on theme: "CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT PNNL:David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric."— Presentation transcript:

Similar presentations

About project

Feedback