Tutorial at WWW 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Ilias Tachmazidis 1,2, Grigoris Antoniou 1,2,3, Giorgos Flouris 2, Spyros Kotoulas 4 1 University of Crete 2 Foundation for Research and Technology, Hellas.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Tutorial at ISWC 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,
Distributed Computations
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Scalable Distributed Reasoning Using MapReduce Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen Department of Computer Science, Vrije.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
BIG DATA/ Hadoop Interview Questions.
LARGE-SCALE SEMANTIC WEB REASONING Grigoris Antoniou and Ilias Tachmazidis University of Huddersfield.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Very VERY large scale knowledge representation In collaboration with:
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Tutorial at WWW 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani

2 Outline Session 1: Introduction to Linked Data Foundations and Architectures Crawling and Indexing Querying Session 2: Integrating Web Data with Reasoning Introduction to RDFS/OWL on the Web Introduction and Motivation for Reasoning Session 3: Distributed Reasoning: Because Size Matters Problems and Challenges MapReduce and WebPIE Session 4: Putting Things Together (Demo) The LarKC Platform Implementing a LarKC Workflow

3 The Semantic Web growth Exponential growth of RDF 2007: 0.5 Billion triples 2008: 2 Billion triples 2009: 6.7 Billion triples 2010: 26.9 Billions triples Now: ?? (Thanks to Chris Bizet for providing these numbers)

4 2008

5 2009

6 2010

7 PROBLEMS AND CHALLENGES

8 Problems and challenges One machine is not enough to store and process the Web We must distribute data and computation What architecture? Several architectures of supercomputers SIMD (single instruction/multiple data) processors, like graphic cards Multiprocessing computers (many CPU shared memory) Clusters (shared nothing architecture) Algorithms depend on the architecture Clusters are becoming the reference architecture for High Performance Computing

9 Problems and challenges In a distributed environment the increase of performance comes at the price of new problems that we must face: Load balancing High I/O cost Programming complexity

10 Problems and challenges: load balancing Cause: In many cases (like reasoning) some data is needed much more than other (e.g. schema triples) Effect: some nodes must work more to serve the others. This hurts scalability

11 Problems and challenges: high I/O cost Cause: data is distributed on several nodes and during reasoning the peers need to heavily exchange it Effect: hard drive or network speed become the performance bottleneck

12 Problems and challenges: programming complexity Cause: in a parallel setting there are many technical issues to handle Fault tolerance Data communication Execution control Etc. Effect: Programmers need to write much more code in order to execute an application on a distributed architecture

13 CURRENT STATE OF THE ART

14 Current work on high performance reasoning Openlink Virtuoso RDF store developed by Openlink Support backward reasoning using OWL logic 4Store RDF store developed by Garlik Perform backward RDFS reasoning Works on clusters (up to 32 nodes) OWLIM Developed by Ontotext Support reasoning up to OWL 2 RL Works on a single machine BigData Developed by Systrap Performs RDFS+ and custom rules / /

15 Current work on high performance reasoning MaRVIN [Kotoulas et al. 2009] First distributed approach. Reasoning using a P2P network Each node reason over local data and exchange the derivation RDFS closure on the blue gene [Weaver et al 2009] Replicates schema data on all node Derivation performed locally (no exchange information) WebPIE [Urbani et al 2010] Reasoning performed with MapReduce Currently most scalable approach (reasoning over 100B triples)

16 Current work on high performance reasoning MaRVIN [Kotoulas et al. 2009] First distributed approach. Reasoning using a P2P network Each node reason over local data and exchange the derivation RDFS closure on the blue gene [Weaver et al 2009] Replicates schema data on all node Derivation performed locally (no exchange information) WebPIE [Urbani et al 2010] Reasoning performed with MapReduce Currently most scalable approach (reasoning over 100B triples)

17 MAPREDUCE

18 MapReduce Analytical tasks over very large data (logs, web) are always the same Iterate over large number of records Extract something interesting from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Idea: provide functional abstraction of these two functions

19 MapReduce In 2004 Google introduced the idea of MapReduce Computation is expressed only with Maps and Reduce Hadoop is a very popular open source MapReduce implementation A MapReduce framework provides Automatic parallelization and distribution Fault tolerance I/O scheduling Monitoring and status updates Users write MapReduce programs -> framework executes them

20 MapReduce A MapReduce program is a sequence of one (or more) map and a reduce function All the information is expressed as a set of key/value pairs The execution of a MapReduce program is the follow: 1.map function transforms input records in intermediate key/value pairs 2.MapReduce framework automatically groups the pairs 3.reduce function processes each group and returns output Example: suppose we want to calculate the occurrences of words in a set of documents. map(null, file) { for (word in file) output(word, 1) } reduce(word, set ) { int count = 0; for (int value : numbers) count += value; output(word, count) }

21 MapReduce “How can MapReduce help us solving the three problems of above?” High communication cost The map functions are executed on local data. This reduces the volume of data that nodes need to exchange Programming complexity In MapReduce the user needs to write only the map and reduce functions. The frameworks takes care of everything else. Load balancing This problem is still not solved.  Further research is necessary…

22 WEBPIE

23 WebPIE WebPIE is a forward reasoner that uses MapReduce to execute the reasoning rules All code, documentation, tutorial etc. is available online. WebPIE algorithm: Input: triples in N-Triples format 1) Compress the data with dictionary encoding 2) Launch reasoning 3) Decompress derived triples Output: triples in N-Triples format 1 st step: compression 2 nd step: reasoning

24 WebPIE 1 st step: compression Compressing the data is necessary to improve the performance In WebPIE we compress the data using dictionary encoding Why dictionary encoding and not simply zip? Data becomes application inaccessible With dictionary encoding apps can still manage data Why MapReduce for compression? Data is too large for one machines Dictionary table is too large to fit in memory Dictionary encoding with MapReduce is challenging! Load balancing due to high data skew Centralized dictionary encoding is a bottleneck in a distribute system

25 WebPIE: compression In WebPIE we solved the load balancing problem processing popular terms in the map and others in the reduce Also, the centralized dictionary is replaced by partitioning numbers and assigning in parallel Ok, but how does it work?

26 WebPIE 1 st step: compression Compression algorithm: sequence of 3 MapReduce jobs 1 st job: identify popular term and assign a number 2 nd job: deconstruct statement: replace popular terms in the map and not popular terms in the reduce 3 rd job: reconstruct statement in compressed format Decompression algorithm: sequence of 4 MapReduce jobs 1 st job: identify popular terms 2 nd job: joins popular terms with dictionary 3 rd job: deconstruct statements and replace popular terms in map and not popular in reduce 4 th job: reconstruct statement in N-Triples format

27 WebPIE 2 nd step: reasoning Reasoning means applying a set of rules on the entire input until no new derivation is possible The difficulty of reasoning depends on the logic considered RDFS reasoning Set of 13 rules All rules require at most one join between a “schema” triple and an “instance” triple OWL reasoning Logic more complex => rules more difficult The ter Horst fragment provides a set of 23 new rules Some rules require a join between instance triples Some rules require multiple joins

28 WebPIE 2 nd step: reasoning Reasoning means applying a set of rules on the entire input until no new derivation is possible The difficulty of reasoning depends on the logic considered RDFS reasoning Set of 13 rules All rules require at most one join between a “schema” triple and an “instance” triple OWL reasoning Logic more complex => rules more difficult The ter Horst fragment provides a set of 23 new rules Some rules require a join between instance triples Some rules require multiple joins

29 WebPIE 2 nd step: RDFS reasoning Q: How can we apply a reasoning rule with MapReduce? A: During the map we write in the intermediate key matching point of the rule and in the reduce we derive the new triples Example: if a rdf:type B and B rdfs:subClassOf C then a rdf:type C

30 WebPIE 2 nd step: RDFS reasoning However, such straightforward way does not work because of several reasons Load balancing Duplicates derivation Etc. In WebPIE we applied three main optimizations to apply the RDFS rules 1.We apply the rules in a specific order to avoid loops 2.We execute the joins replicating and loading the schema triples in memory 3.We perform the joins in the reduce function and use the map function to generate less duplicates

31 WebPIE 2 nd step: RDFS reasoning 1 st optimization: apply rules in a specific order

32 WebPIE 2 nd step: RDFS reasoning 2 nd optimization: perform the join during the map The schema is small enough to fit in memory Each node loads them in memory The instance triples are read as MapReduce input and the join is done against the in-memory set 3 rd optimization: avoid duplicates with special grouping The join can be performed either in the map or in the reduce If we do it in the reduce, then we can group the triples so that the key is equal to the derivation part that is input dependent. Groups cannot generate same derived triple => no duplicates

33 WebPIE 2 nd step: reasoning Reasoning means applying a set of rules on the entire input until no new derivation is possible The difficulty of reasoning depends on the logic considered RDFS reasoning Set of 13 rules All rules require at most one join between a “schema” triple and an “instance” triple OWL reasoning Logic more complex => rules more difficult The ter Horst fragment provides a set of 23 new rules Some rules require a join between instance triples Some rules require multiple joins

34 WebPIE 2 nd step: OWL reasoning Since the RDFS optimizations are not enough, we introduced new optimizations to deal with the more complex rules We will not explain all of them, but only one Example: This rule is problematic because Need to perform join between instance triples Every time we derive also what we derived before Solution: we perform the join in the “naïve” way, but we only consider triples on a “specific” position if and and then

35 WebPIE 2 nd step: OWL reasoning {, 1} Input: {, 2} 1st M/R job Output: Example

36 WebPIE 2 nd step: OWL reasoning {, 1} {, 2}... After job 1: {, 3} {, 4} 2nd M/R job Output: Example

37 WebPIE 2 nd step: OWL reasoning By accepting only triples with a specific distance we avoid to derive information already derived General rule: Every job accepts in input only triples derived in the previous two steps During the execution of the nth job we derive only if: The antecedent triples on the left side have distance 2^(n – 1) or 2^(n – 2) The antecedent triples on the right side have distance greater than 2^(n – 2)

38 WebPIE: performance We tested the performance on LUBM, LDSR, Uniprot Tests were conducted at the DAS-3 cluster ( Performance depends not only on input size but also the complexity of the input Execution time using 32 nodes: DatasetInputOutputExec. time LUBM1 Billion0.5 Billion1 Hour LDSR0.9 Billion 3.5 Hours Uniprot1.5 Billion2 Billions6 Hours

39 WebPIE: performance Input complexity Reasoning on LUBM Reasoning on LDSR

40 WebPIE: performance Scalability (on the input size, using LUBM to 100 Billion triples)

41 WebPIE: performance Scalability (on the number of nodes, up to 64 nodes)

42 WebPIE: performance

43 WebPIE: performance We are here!!

44 Conclusions With WebPIE we show that high performance reasoning on very large data is possible We need to compromise w.r.t. reasoning complexity and performance Still many problems unresolved: How do we collect the data? How do we query large data? How do we introduce a form of authoritative reasoning to prevent “bad” derivation? Etc.

45 DEMO