Implementing Iterative Algorithms with SPARQL Robert Techentin, Barry Gilbert, Mayo Clinic Adam Lugowski, Kevin Deweese, John Gilbert, University of California.

Slides:



Advertisements
Similar presentations
1. 2  RDF triples database – memory-resident  SPARQL query language  Aimed at customers who Have large datasets Want to do graph analytics.
Advertisements

Google News Personalization Scalable Online Collaborative Filtering
Register Allocation Consists of two parts: Goal : minimize spills
Problems and Their Classes
K-MEANS ALGORITHM Jelena Vukovic 53/07
Overview Structural Testing Introduction – General Concepts
CrowdER - Crowdsourcing Entity Resolution
Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.
www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Exact Inference in Bayes Nets
Activity relationship analysis
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Train DEPOT PROBLEM USING PERMUTATION GRAPHS
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Recommender systems Ram Akella November 26 th 2008.
Improving Code Generation Honors Compilers April 16 th 2002.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
1 CS101 Introduction to Computing Lecture 19 Programming Languages.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Pregel: A System for Large-Scale Graph Processing
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Teaching Deliberative Navigation Using the LEGO RCX and Standard LEGO Components Gary R. Mayer, Dr. Jerry Weinberg, Dr. Xudong Yu
Knowledge Discovery Toolbox kdt.sourceforge.net Adam Lugowski.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard Department of Computer Science University.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
Community Evolution in Dynamic Multi-Mode Networks Lei Tang, Huan Liu Jianping Zhang Zohreh Nazeri Danesh Zandi & Afshin Rahmany Spring 12SRBIAU, Kurdistan.
Automatic interpretation of salt geobodies Adam Halpert ExxonMobil CEES Visit 12 November 2010 Stanford Exploration Project.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
CS101 Introduction to Computing Lecture Programming Languages.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
1/24 Introduction to Graphs. 2/24 Graph Definition Graph : consists of vertices and edges. Each edge must start and end at a vertex. Graph G = (V, E)
Convergence of PageRank and HITS Algorithms Victor Boyarshinov Eric Anderson 12/5/02.
A System to Generate Test Data and Symbolically Execute Programs Lori A. Clarke Presented by: Xia Cheng.
CSCI 115 Chapter 8 Topics in Graph Theory. CSCI 115 §8.1 Graphs.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Chapter 4 Grouping Objects. Flexible Sized Collections  When writing a program, we often need to be able to group objects into collections  It is typical.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Data Structures and Algorithms in Parallel Computing
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
1 Software Testing & Quality Assurance Lecture 13 Created by: Paulo Alencar Modified by: Frank Xu.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
CS Machine Learning Instance Based Learning (Adapted from various sources)
ALGORITHMS AND FLOWCHARTS. Why Algorithm is needed? 2 Computer Program ? Set of instructions to perform some specific task Is Program itself a Software.
Graph Data Management Lab, School of Computer Science Add title here: Large graph processing
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Flow Control in Imperative Languages. Activity 1 What does the word: ‘Imperative’ mean? 5mins …having CONTROL and ORDER!
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
CS101 Introduction to Computing Lecture 19 Programming Languages
Think What will be the output?
Spare Register Aware Prefetching for Graph Algorithms on GPUs
CPSC-310 Database Systems
CS 440 Database Management Systems
S.JOSEPHINE THERESA, DEPT OF CS, SJC, TRICHY-2
A framework for ontology Learning FROM Big Data
Presentation transcript:

Implementing Iterative Algorithms with SPARQL Robert Techentin, Barry Gilbert, Mayo Clinic Adam Lugowski, Kevin Deweese, John Gilbert, University of California Santa Barbara Eric Dull, Mike Hinchey, Steven P. Reinhardt, YarcData/Cray GraphQ 2014, Athens

Agenda YarcData/Cray’s interest in graphs and SPARQL Approach to using iterative algorithms with SPARQL Three implementations Lessons Future work

YarcData/Cray and Graphs YarcData develops and sells the Urika™ graph appliance – An in-memory graph store scalable to 512TB of RAM – Based on RDF/SPARQL standards – Shipping since Jan2012 – Typically working with 10-40B triples Target market is “discovery analytics” – Subject-matter expert (human) is in the loop – “Discovery” means a) need interactive response time (O(10s of seconds)) and b) queries are not known in advance

Iterative Algorithms Rich recent work in machine learning and related fields Want to reuse these algorithms on graphs, in RDF/SPARQL Many important algorithms are iterative (e.g., PageRank, Markov clustering, SVD) SPARQL has no mechanism for iteration – And using nested queries appears insufficient

Implementation Approach Overhead of a SPARQL query is small enough to implement via multiple queries Resulting structure establish initial state do { update state measure convergence criteria } while (convergence not met) establish final state and clean up Intermediate state is typically large, so want it to reside in the DB, hence need to INSERT it

Algorithms Implemented Peer-pressure clustering (UCSB, YD) – Clusters vertices by propagating the most popular cluster of each vertex’s neighbors Graph diffusion (Mayo, YD ) – Models natural transport process, using the connectivity of the graph – Can quantify which connections were most important to transportation Label propagation (YD) – Similar to peer-pressure – Clusters vertices by popularity, with self-voting possible

Peer-pressure Clustering Algorithm assign each vertex to an initial cluster of its own do { (re-)assign each vertex to the cluster to which a plurality of its neighbors belong count the number of vertices that changed cluster in the prior step } while (enough vertices changed or other criteria)

Vertex-(re)assignment Query INSERT { GRAPH { ?s ?clus3 } } WHERE { { SELECT ?s (SAMPLE(?clus) AS ?clus3) WHERE { { SELECT ?s (MAX(?clusCt) AS ?maxClusCt) WHERE { SELECT ?s ?clus (COUNT(?clus) AS ?clusCt) WHERE { ?s ?o. GRAPH {?o ?clus} } GROUP ?s ?clus } GROUP BY ?s } { SELECT ?s ?clus (COUNT(?clus) AS ?clusCt) WHERE { ?s ?o. GRAPH {?o ?clus} } GROUP BY ?s ?clus } FILTER (?clusCt = ?maxClusCt) } GROUP BY ?s } For each vertex, find the cardinality of the most popular cluster to which its neighbors belong For each vertex, find the name and cardinality of each cluster to which its neighbors belong For each vertex, select the cluster which has cardinality equal to maximum cardinality

Lessons Pros This approach works Implementation is within the control of the developer, as it depends solely on standard SPARQL 1.1 and SPARQL Update functionality Performance can be very good Cons Implementation requires use of a second language for iteration (JavaScript and Python, in our cases); this complicates development, esp. debugging Computational inefficiencies – Executing the same selection code (to focus on the vertices/edges in question) in each query is redundant – INSERTing intermediate results is a heavier-weight operation than required

Lessons (2) Performance – In RDF graph, data to be clustered is typically a small fraction of the total data – Largest algorithm data 89M vertices / 1.6B edges – Iteration times s, depending on data size – Number of iterations varies by algorithm and data Need algorithms that cope with heterogeneous data

A Related Development The ability to call a (possibly iterative) graph function built into the SPARQL endpoint PREFIX yd: CONSTRUCT { # input graph for community detection algorithm consists of weighted edges ?vertex1 ?weight ?vertex2. } WHERE { ?vertex1 ?vertex2. # following statement weights intra-clique edges 10X more than inter-clique BIND (IF(SUBSTR(STR(?vertex1),11,11)=SUBSTR(STR(?vertex2),11,11), "10"^^xsd:integer, "1"^^xsd:integer) as ?weight) } INVOKE yd:graphAlgorithm.community (maxCommunitySize=200) PRODUCING (?vertex ?communityID) New feature in Urika Spring 2014 release Likely better performance, though less control by query developer

Summary Implementing iterative algorithms through sequences of SPARQL queries is a reasonable approach Delivers good performance and great query- developer flexibility Inherent performance issues may inhibit extreme performance Likely to be one of a variety of means to implement graph algorithms in SPARQL