National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.

Slides:

Advertisements

Similar presentations

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Advertisements

Peer to Peer and Distributed Hash Tables

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

1 Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks Erietta Liarou, Stratos Idreos, and Manolis Koubarakis Waled.

GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.

CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.

Technische Universität Yimei Liao Chemnitz Kurt Tutschku Vertretung - Professur Rechner- netze und verteilte Systeme Chord - A Distributed Hash Table Yimei.

The Chord P2P Network Some slides have been borowed from the original presentation by the authors.

CHORD: A Peer-to-Peer Lookup Service CHORD: A Peer-to-Peer Lookup Service Ion StoicaRobert Morris David R. Karger M. Frans Kaashoek Hari Balakrishnan Presented.

Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.

Massively Distributed Database Systems Distributed Hash Spring 2014 Ki-Joune Li Pusan National University.

Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.

1 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan.

X Non-Transitive Connectivity and DHTs Mike Freedman Karthik Lakshminarayanan Sean Rhea Ion Stoica WORLDS 2005.

Scalable Resource Information Service for Computational Grids Nian-Feng Tzeng Center for Advanced Computer Studies University of Louisiana at Lafayette.

Looking Up Data in P2P Systems Hari Balakrishnan M.Frans Kaashoek David Karger Robert Morris Ion Stoica.

Mercury: Scalable Routing for Range Queries Ashwin R. Bharambe Carnegie Mellon University With Mukesh Agrawal, Srinivasan Seshan.

A Scalable and Load-Balanced Lookup Protocol for High Performance Peer-to-Peer Distributed System Jerry Chou and Tai-Yi Huang Embedded & Operating System.

Integrating Semantics-Based Access Mechanisms with P2P File Systems Yingwu Zhu, Honghao Wang and Yiming Hu.

Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Peer-to-Peer.

Object Naming & Content based Object Search 2/3/2003.

Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.

1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.

Data Structures Hash Table (aka Dictionary) i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst, Brian Hayes, Andreas Veneris, Glenn Brookshear,

Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.

1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.

File Sharing : Hash/Lookup Yossi Shasho (HW in last slide) Based on Chord: A Scalable Peer-to-peer Lookup Service for Internet ApplicationsChord: A Scalable.

SIMULATING A MOBILE PEER-TO-PEER NETWORK Simo Sibakov Department of Communications and Networking (Comnet) Helsinki University of Technology Supervisor:

CSE 461 University of Washington1 Topic Peer-to-peer content delivery – Runs without dedicated infrastructure – BitTorrent as an example Peer.

INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.

Overlay network concept Case study: Distributed Hash table (DHT) Case study: Distributed Hash table (DHT)

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.

Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.

1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Peer-to-Peer Name Service (P2PNS) Ingmar Baumgart Institute of Telematics, Universität Karlsruhe IETF 70, Vancouver.

Presentation 1 By: Hitesh Chheda 2/2/2010. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Lecture 12 Distributed Hash Tables CPE 401/601 Computer Network Systems slides are modified from Jennifer Rexford.

Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Distributed Hash.

BY: REBECCA NAVARRE & MICHAEL BAKER II Persea: Making Networks More Secure Since Early 2013.

1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.

CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Distributed Hash Tables Steve Ko Computer Sciences and Engineering University at Buffalo.

Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks

A Sybil-Proof Distributed Hash Table Chris Lesniewski-LaasM. Frans Kaashoek MIT 28 April 2010 NSDI

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

CS694 - DHT1 Distributed Hash Table Systems Hui Zhang University of Southern California.

CSE 486/586 Distributed Systems Distributed Hash Tables

Distributed Hash Tables (DHT) Jukka K. Nurminen *Adapted from slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen)

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,

The Chord P2P Network Some slides taken from the original presentation by the authors.

Peer-to-Peer Information Systems Week 12: Naming

Magdalena Balazinska, Hari Balakrishnan, and David Karger

CSE 486/586 Distributed Systems Distributed Hash Tables

The Chord P2P Network Some slides have been borrowed from the original presentation by the authors.

Distributed Hash Tables

Improving and Generalizing Chord

EE 122: Peer-to-Peer (P2P) Networks

DHT Routing Geometries and Chord

Dewan Tanvir Ahmed and Shervin Shirmohammadi

Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service

A Semantic Peer-to-Peer Overlay for Web Services Discovery

Consistent Hashing and Distributed Hash Table

CSE 486/586 Distributed Systems Distributed Hash Tables

Peer-to-Peer Information Systems Week 12: Naming

Presentation transcript:

National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi MATONO Grid Technology Research Center, AIST

National Institute of Advanced Industrial Science and Technology Agenda Motivation & Aims Background Distributed Hash Table (DHT) Distributed Hash Table (DHT) Our approach Performance evaluation Summary

National Institute of Advanced Industrial Science and Technology Motivation It is essential to describe resources using RDF to provide semantic tasks (e.g., resource discovery). Today, RDF data is widely used in many fields (e.g., bioinformatics and grid). Thus, RDF data is scattered everywhere and the total data size is rapidly increasing. We proposed a P2P-based RDF query processing. Providing efficient and scalable RDF query processing in a distributed environment is an important issue.

National Institute of Advanced Industrial Science and Technology Aims RDF data is scattered everywhere. Provide an efficient join operation in a distributed environment. The amount of data is rapidly increasing. Reduce the amount of data transferred among nodes. Achieve scalability, availability, and reliability.

National Institute of Advanced Industrial Science and Technology Distributed Hash Table (DHT) A structured P2P network. Achieve scalability, availability, reliability. Support only exact-match lookups. Lookups for key-value pairs. put (key, value), get (key) Routing is performed in. Routing is performed in O (log n). Some protocols. Chord, Tapestry, Pastry, CAN, Kademlia Chord, Tapestry, Pastry, CAN, Kademlia

National Institute of Advanced Industrial Science and Technology N42 N27 N11 Chord [Stoica01] N11N N42N42 +0 keyssucc.distance N63 N50 N48 N42 +4 N42 +2 N42 +1 N N42 +8 finger table N2 N56 N50 N17 N63 N48 N6 N33 … … …… … … … put ( 28, A ) on N42 The node that is responsible for the key 28 is N11 … N27N N11N11 +0 keyssucc.distance N17N11 +4 N17N11 +2 N17N11 +1 N48N N27N11 +8 finger table … 28 N33N N27N27 +0 keyssucc.distance N63 N48 N42 N33 N27 +4 N27 +2 N N N27 +8 finger table N33 is the target node … Key 28 This data in this area is stored into Node 27 The distance to the nodes increases exponentially.

National Institute of Advanced Industrial Science and Technology Our Approach Three-dimensional hash space called “RDFCube” Each axis represents hash space for one of subject, predicate, and object. Each axis represents hash space for one of subject, predicate, and object. Consist of a set of cubes of the same size called “cells” Consist of a set of cubes of the same size called “cells” Bit information of RDFCube called “existence flag” Each cell contains a bit that indicates the present or absent of triples mapped into the cell. Each cell contains a bit that indicates the present or absent of triples mapped into the cell. Run on the top of two DHTs. RDFPeers DHT is used to store triples. RDFPeers DHT is used to store triples. RDFCube DHT is used to store bit information. RDFCube DHT is used to store bit information.

National Institute of Advanced Industrial Science and Technology RDFCube: three-dimensional hash space Each axis represents hash space for one of triple’s elements (subject, predicate, and object). RDFCube is composed of a set of cubes of the same size called “cells”. A triple is mapped into RDFCube based on the hash values of elements. o s p (13, 54, 39) This triple is mapped into the point (13, 54, 39). The point is contained in the cell [0,3,2]. hash subject object predicate Triple (13, 54, 39) Cell [0, 3, 2] subject object predicate

National Institute of Advanced Industrial Science and Technology Existence Flag Each of cells contains a bit that indicates the present or absence of triples mapped into the cell. subject object Cell Sequence [0, 1, *] Bit Sequence predicate Cell Matrix [0, *, *] o s p Bit Matrix Existence Flag Cell [0, 3, 2]

National Institute of Advanced Industrial Science and Technology Two DHTs: RDFCube & RDFPeers RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT. RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004] Proposed by [Min Cai and Martin Frank, 2004] RDFCube DHT is used to store bit information. Used as an index for RDFPeers. Used as an index for RDFPeers. Storing triples. 1. Store the triples to RDFPeers DHT. 2. Store the bit information of the triples into RDFCube DHT. Query processing with join operation. 1. Get the bit information from RDFCube DHT. 2. Perform AND operations of the bits. 3. Get triples from RDFPeers DHT based on the bit information.

National Institute of Advanced Industrial Science and Technology RDFPeers [Cai04] An RDF repository utilizing a DHT. We call the DHT for RDFPeers as RDFPeers DHT. Key : Each of subject, predicate and object Key : Each of subject, predicate and object Value : Triple Value : Triple To store a triple into RDFPeers DHT. The triple is stored 3 times into 3 nodes by 3 lookups using triple’s elements as keys. The triple is stored 3 times into 3 nodes by 3 lookups using triple’s elements as keys. o s p N63 N8 N55 N41 N25 N4 N21 RDFPeers DHT key: (by predicate) value: key: (by object) value: key: (by subejct) value: s o o s p p o s p o s p put (, ) key value s p o s p o s p o s p o

National Institute of Advanced Industrial Science and Technology N63 N8 N55 N41 N25 N4 N21 RDFPeers DHT RDFPeers [Cai04] An RDF repository utilizing a DHT. We call the DHT for RDFPeers as RDFPeers DHT. Key : Each of subject, predicate and object Key : Each of subject, predicate and object Value : Triple Value : Triple Given a query triple Perform a lookup using one of the constants as a key. Perform a lookup using one of the constants as a key. key: (by predicate) value: ? s p key: (by object) value: key: (by subejct) value: s o o s p p o s p o s p get ( ) or get ( ) key s p N55 N21 s p

National Institute of Advanced Industrial Science and Technology Two DHTs: RDFCube & RDFPeers RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT. RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004] Proposed by [Min Cai and Martin Frank, 2004] RDFCube DHT is used to store bit information. Used as an index for RDFPeers. Used as an index for RDFPeers. Storing triples. 1. Store the triples to RDFPeers DHT. 2. Store the bit information of the triples into RDFCube DHT. Query processing with join operation. 1. Get the bit information from RDFCube DHT. 2. Perform AND operations of the bits. 3. Get triples from RDFPeers DHT based on the bit information.

National Institute of Advanced Industrial Science and Technology Key : ID of cell matrix Value : Bit matrix To set a bit of cell to 1 in RDFCube DHT Perform 3 lookups using 3 cell matrixes containing the cell as keys. Perform 3 lookups using 3 cell matrixes containing the cell as keys. RDFCube DHT put (, ) key value [1, *, *] [*, 2, *] [*, *, 1] [1, 2, 1] [1, *, *][*, 2, *][*, *, 1] N1 N15 N57 N36 N28 N51 RDFCube DHT key: value: [1, *, *] key: value: [*, 2, *] key: value: [*, *, 1]

National Institute of Advanced Industrial Science and Technology Key : ID of cell matrix Value : Bit matrix To get a bit matrix of cell matrix Perform a lookup using the cell matrix id as a key. Perform a lookup using the cell matrix id as a key. RDFCube DHT get ( ) key [1, *, *] N1 N15 N57 N36 N28 N51 RDFCube DHT key: value: [1, *, *] key: value: [*, 2, *] key: value: [*, *, 1] [1, *, *] N57 [1, *, *]

National Institute of Advanced Industrial Science and Technology Two DHTs: RDFCube & RDFPeers RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT. RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004] Proposed by [Min Cai and Martin Frank, 2004] RDFCube DHT is use to store bit information. Used as an index for RDFPeers. Used as an index for RDFPeers. Storing triples. 1. Store the triples into RDFPeers DHT. 2. Store the bit information of the triples into RDFCube DHT. Query processing with join operation. 1. Get the bit information from RDFCube DHT. 2. Perform AND operations of the bits. 3. Get triples from RDFPeers DHT based on the bit information.

National Institute of Advanced Industrial Science and Technology Storing Triples Given the triple Update RDFPeers DHT Update RDFPeers DHT Store the triple into RDFPeers DHT by 3 lookups. Update RDFCube DHT Update RDFCube DHT Get the cell where the triple is mapped into. Set each bit in the 3 bit matrixes to 1 by 3 lookups. o s p (21, 45, 17) hash cell [1, 2, 1] o s p put (, ) s p o s p o s p o s p o key value N63 N8 N55 N41 N25 N4 N21 RDFPeers DHT key: (by predicate) value: key: (by object) value: key: (by subejct) value: s o o s p p o s p o s p put (, ) key value [1, *, *] [*, 2, *] [*, *, 1] [1, 2, 1] N1 N15 N57 N36 N28 N51 RDFCube DHT [*, 2, *] [*, *, 1] [1, *, *]

National Institute of Advanced Industrial Science and Technology Two DHTs: RDFCube & RDFPeers RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT. RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004] Proposed by [Min Cai and Martin Frank, 2004] RDFCube DHT is used to store bit information. Used as an index for RDFPeers. Used as an index for RDFPeers. String triples. 1. Store the triples to RDFPeers DHT. 2. Store the bit information of the triples into RDFCube DHT. Query processing with join operation. 1. Get the bit information from RDFCube DHT. 2. Perform AND operations of the bits. 3. Get triples from RDFPeers DHT based on the bit information.

National Institute of Advanced Industrial Science and Technology 0010 AND Operation Given the query 1.Get bit information of the cells where the query triples are mapped into. 2.Perform AND operation between the bits. p2 o2 o1 ?x p1 Query Processing (1/2) p2 B2 ?xA1 ?x p1 [*, 3, 2] [*, 1, 1]

National Institute of Advanced Industrial Science and Technology N63 N8 N55 N41 N25 N4 RDFPeersDHT N21 Query Processing (2/2) key: (by predicate) value: p1 ?xA1 p1 Candidate answers s0 p1 A0s1 p1 A1s2 p1 A1s3 p1 A2 3.Get triples from RDFPeers DHT based on the bit information 1. Access to a remote node where candidate answer triples are stored into. 2. For each triple, we check whether the bit of the cell where the triple is mapped into is equal to 1.

National Institute of Advanced Industrial Science and Technology Narrow down the number of the candidate answers Not Answers 3.Get triples from RDFPeers DHT based on the bit information 1. Access to a remote node where candidate answer triples are stored into. 2. For each triple, we check whether the bit of the cell where the triple is mapped into is equal to Return the candidate answer triples that satisfy the condition from the remote node. Query Processing (2/2) ?xA1 p1 Candidate answers s0 p1 A0s1 p1 A1s2 p1 A1s3 p1 A2 [0, 3, 2] [1, 3, 2] [2, 3, 2] [3, 3, 2] Filtering based on the bit information

National Institute of Advanced Industrial Science and Technology Performance Evaluation Compare RDFPeers with RDFPeers+RDFCube Data Set Transform XML documents of DBLP into RDF data. Transform XML documents of DBLP into RDF data. Create 4 RDFs of different triples (12500, 25000, 50000, ). Create 4 RDFs of different triples (12500, 25000, 50000, ).Environments Emulate 100-node Chord network. Emulate 100-node Chord network. #divisions of RDFCube is #divisions of RDFCube is Queries CUBEPEERS Query 1 ?x Article “Jim Gray” “1998” “CoRR” type author year journal Query 2Query 3 ?y“LNCS” title ?x series ?x“VLDB2004” title ?y crossref title ?z

National Institute of Advanced Industrial Science and Technology Storing Performance PEERS is the network costs for storing triples CUBE is the network costs for storing triples and index construction. If the ratio = 2, the cost for storing triples = index construction. If the ratio = 1, the cost for index construction is nothing. The ratio of #hops is smaller than 2, The cost for index construction is smaller than that for storing triples. The ratio of transfer data size is very close to 1, The amount of data transferred for index construction is very small.

National Institute of Advanced Industrial Science and Technology Retrieval Performance PEERS is the network costs to get triples from RDFPeers DHT. CUBE is the network costs to get bits and triples from two DHTs. #hops on CUBE is twice as many as that on PEERS. #hops to get triples is equal to #hops to get bit information. The transfer data size is reduced to at most 1/50 in query 1. Our approach makes it possible to reduce transfer size. In particular, when the query has lots of the same variables.

National Institute of Advanced Industrial Science and Technology Scalability The ratio of CUBE to PEERS stays constant in all queries. Our approach achieves the scalability with respect to the number of triples.

National Institute of Advanced Industrial Science and Technology Summary What we have achieved. Scalability with respect to #triples. Scalability with respect to #triples. Reduce the amount of data transferred among nodes. Reduce the amount of data transferred among nodes. What are our major current challenges. Provide efficient RDF query processing with join operations in a distributed environment. Provide efficient RDF query processing with join operations in a distributed environment. What we will achieve in the near future. Eliminate redistribution of triples. Eliminate redistribution of triples. Utilize the schema information. Utilize the schema information. Dynamic division mechanism of RDFCube. Dynamic division mechanism of RDFCube.

National Institute of Advanced Industrial Science and Technology Thank You