Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Graph Data Management Lab, School of Computer Scalable SPARQL Querying of Large RDF Graphs Xu Bo
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham.
Triple Stores.
Ch 4. The Evolution of Analytic Scalability
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Building and Analyzing Social Networks Insider Threat Analysis with Large Graphs Dr. Bhavani Thuraisingham March 22, 2013.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Mining High Utility Itemset in Big Data
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
A Token-Based Access Control System for RDF Data in the Clouds Arindam Khaled Mohammad Farhan Husain Latifur Khan Kevin Hamlen Bhavani Thuraisingham Department.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Scalable Distributed Reasoning Using MapReduce Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen Department of Computer Science, Vrije.
RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.
Scalable Keyword Search on Large RDF Data. Abstract Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Dr. Mohammad Farhan Husain (Amazan; Facebook)
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Large-scale file systems and Map-Reduce
Distributed Storage and Querying Techniques for a Semantic Web of Scientific Workflow Provenance The ProvBase System Artem Chebotko (joint work with.
Ministry of Higher Education
Chapter 15 QUERY EXECUTION.
Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas IEEE 2010 Cloud Computing May 11, 2011 Taikyoung Kim SNU IDB Lab.

Outline  Introduction  Proposed Architecture  MapReduce Framework  Results  Conclusions and Future Works 2

Introduction  With the explosion of semantic web technologies, the need to store and retrieve large amounts of data is common –The most prominent standards are RDF and SPARQL  Current frameworks do not scale for large RDF graphs –E.g. Jena, Sesame, BigOWLIM –Designed for a single machine scenario –Only 10 million triples can be processed in a Jena in-memory(2GB) model 3

Introduction  A distributed system can be built to overcome the scalability and performance problems of current Semantic Web frameworks  However, there is no distributed repository for storing and managing RDF data –Distributed database system or relational databases are available  Performance and Scalability issues –Possible to construct a distributed system from scratch  A better way is to use Cloud Computing framework or generic distributed storage system –Just tailor it to meet the needs of semantic web data 4

Introduction  Hadoop is an emerging Cloud Computing tools –Open source –High fault tolerance –Great reliability –MapReduce programming model  We introduce a schema to store RDF data in Hadoop  Our goal is to answer SPARQL queries as efficiently as possible using summary statistics about the data –Choose the best plan based on a cost model –The plan determines the number of jobs and also their sequence and inputs 5

Introduction  Contributions 1.Design a storage scheme to store RDF data in HDFS * 2.Device an algorithm which determines the best query plan for SPARQL query 3.Build a cost model for query processing plan 4.Demonstrate that our approach performs better than Jena 6 HDFS – Hadoop Distributed File System

Outline  Introduction  Proposed Architecture  MapReduce Framework  Results  Conclusions and Future Works 7

Proposed Architecture  Data Generation and Storage –Use the LUBM dataset (benchmark datasets)  Generate RDF/XML serialization format –Convert the data to N-Triples to store data  One RDF triple in one line of a file  File Organization –Do not store the data in a file since  A file is the smallest unit of input to a MapReduce job  A file is always read from the disk (No cache) –Divide the data into multiple smaller files 8 administrative staff worker AdministrativeStaff rdfs:subClassOf Employee

Proposed Architecture  Predicate Split (PS) –Divide the data according to the predicates  Can cut down the search space if the query has no variable predicate –Name the files with predicates  e.g) predicate p1:pred go into a file named p1-pred 9 John | rdf:type | Student James | rdf:type | Professor John | ub:advisor | James John | ub:takesCourse | DB “rdf-type” file“ub-advisor” file“ub-takesCourse” file John | Student James | Professor John | Student James | Professor John | James John | DB

Proposed Architecture  Predicate Object Split (POS) –Split Using Explicit Type Information of Object  Devide rdf-type file into as many files as the number of distinct objects the rdf:type predicate has –Split Using Implicit Type Information of Object  Keep intact all literal objects  URI objects move into their respective file named as predicate_type –The type information of a URI object can be retrieved from the rdf-type_* files 10 “rdf-type” file rdf-type_Student“ub-advisor” file John rdf-type_Professor James URI 1 James URI 1 ub-advisor_Projessor John John | Student James | Professor URI 1 | Professor John | Student James | Professor URI 1 | Professor John | URI 1

Proposed Architecture  Space benefits  Special case –Search all the files having leaf types of the subtree rooted at that type node  E.g. type-FullProfessor, type-AssociateProfessor, etc. 11

Outline  Introduction  Proposed Architecture  MapReduce Framework  Results  Conclusions and Future Works 12

MapReduce Framework  Challenges 1.Determine the number of jobs needed to answer a query 2.Minimize the size of intermediate files 3.Determine number of reducers  Use Map phase for selection and Reduce phase for join  Often require more than one job –No inter-process communication –Each job may depend on the output of the previous job 13

MapReduce Framework Input Files Selection  Select all files when –P: variable & O: variable & O: has no type info. –O: concrete  Select all predicate files having object of that type when –P: variable & O: has type info.  Select all files for the predicate when –P: concrete & O: variable & O: has no type info.  Select the predicate file having objects of that type when –Query has type information of the object  Select all subclasses which are leaves in the subtree rooted at the type node when –Type associated with a predicate is not a leaf in the ontology tree 14 P: predicate, O: object

MapReduce Framework Cost Estimation for Query Processing  Definition 1 (Conflicting Joins, CJ) –A pair of joins on different variables sharing a triple pattern –Join A (Line1&Line3), Join B (Line3&Line4)  CJ (Line3)  Definition 2 (NonConflicting Joins, NCJ) –A pair of joins not sharing any triple pattern –A pair of joins sharing a triple pattern and the joins are on same variable –Join 1 (Line1&Line3), Join 2 (Line2&Line4)  NCJ 15 Line 1 Line 2 Line 3 Line 4

MapReduce Framework Cost Estimation for Query Processing  Map Input phase (MI) –Read the triple patterns from the selected input file –Cost equals to the total number of triple patterns in each selected file  Map Output phase (MO) –No bound variable case (e.g. [?X ub:worksFor ?Y])  MO cost = MI cost ( All of the triple patterns are transformed into key-value pairs ) –Bound variable case (e.g. [?Y ub:subOrganizationOf ])  Use summary statistics for selectivity  The cost is the result of bound component selectivity estimation 16 MI :cost of Map Input phase MO :cost of Map Output phase RI :cost of Reduce Input phase RO :cost of Reduce Output phase

MapReduce Framework Cost Estimation for Query Processing  Reduce Input Phase (RI) –Read Map output via HTTP and then sort it by key values –RI cost = MO cost  Reduce Output Phase (RO) –Deal with performing the joins –Use the join triple pattern selectivity summary statistics (No longer used) –For the intermediate jobs, take an upper bound on the Reduce Output 17

MapReduce Framework Query Plan Generation  Need to determine the best query plan –Possible plans to answer a query has different performance (time & space)  Plan Generation –Greedy approach  Simple  Generates a plan very quickly  No guarantee for best plan –Exhaustively search approach (ours)  Generate all possible plans 18

MapReduce Framework Query Plan Generation  Plan Generation by Graph Coloring –Generate all combinations –For a job, select a subset of NCJ  Dynamically determine the number of jobs –Once the plan is generated, determine the cost using the cost model 19 1X1X 2Y2Y 4Y4Y 3 X,Y X Y Y Y A D C B Triple Pattern Graph Line 1 Line 2 Line 3 Line 4 A B D C A B D C job1 job2 Join Graph

Outline  Introduction  Proposed Architecture  MapReduce Framework  Results  Conclusions and Future Works 20

Results Comparison with Other Frameworks  Performance comparison between –Our framework –Jena In-Memory and SDB model –BigOWLIM  System for testing Jena and BigOWLIM –2.80 GHz quad core processor –8GB main memory (BigOWLIM needed 7 GB for billion triples dataset) –1 TB disk space  Cluster of 10 nodes –Pentium IV 2.80 GHz processor –4GB main memory –640 GB disk space 21

Results Comparison with Other Frameworks  Jena In-Memory Model worked well for small datasets –Became slower as the dataset size grew and eventually run out of memory  BigOWLIM has significantly higher loading time than ours –It builds indexes and pre-fetches triples to main memory  Hadoop cluster takes less than 1 minute to start up –Excluding loading time, ours is faster when there is no bound object 22

Results Comparison with Other Frameworks  As the size of the dataset grows, the increase in time to answer a query does not grow proportionately 23

Results Experiment with Number of Reducers  As increase the number of reducers, queries are answered faster  The sizes of map output of query 1, 12 and 13 are so small –Can process with one reducer 24

Conclusions and Future Works  We proposed –a schema to store RDF data in plain text files –An algorithm to determine the best processing plan to answer a SPARQL query –A cost model to be used by the algorithm  Our system is highly scalable –Query answer time does not increase as much as data size grow  We will extend the work in the future –Build a cloud service based on the framework –Investigate the skewed distribution of the data –Experiment with heterogeneous cluster 25

Thank you Question?