Very VERY large scale knowledge representation In collaboration with:

Slides:



Advertisements
Similar presentations
AIFB Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 1 Mind the Web! Valentin Zacharias, Andreas Abecker, Imen.
Advertisements

Frank van Harmelen Vrije Universiteit Amsterdam The Information Universe of the (Near) Futur e Creative Commons License: allowed to share & remix, but.
George Anadiotis, Spyros Kotoulas and Ronny Siebes VU University Amsterdam.
Frank van Harmelen Vrije Universiteit Amsterdam The Web of data and LarKC’s role in it Creative Commons License: allowed to share & remix, but must attribute.
Big Data: Big Challenges for Computer Science Henri Bal Vrije Universiteit Amsterdam.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Ilias Tachmazidis 1,2, Grigoris Antoniou 1,2,3, Giorgos Flouris 2, Spyros Kotoulas 4 1 University of Crete 2 Foundation for Research and Technology, Hellas.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Tutorial at WWW 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,
Frank van Harmelen Semantics: where are we now, where should we go? Creative Commons CC BY 3.0: allowed to share & remix (also commercial) but must attribute.
Ontologies and the Semantic Web by Ian Horrocks presented by Thomas Packer 1.
Tutorial at ISWC 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,
1 Efficient Massive Sharing of Content among Peers by Peter Triantafillou, Chryssani Xiruhaki and Manolis Koubarakis Dept. of Electronics and Computer.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Knowledge based Learning Experience Management on the Semantic Web Feng (Barry) TAO, Hugh Davis Learning Society Lab University of Southampton.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
IDB, SNU Dong-Hyuk Im Efficient Computing Deltas between RDF Models using RDFS Entailment Rules (working title)
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Henri Bal Vrije Universiteit Amsterdam High Performance Distributed Computing.
The Semantic Web from ft Frank van Harmelen Creative Commons License: allowed to share & remix, but must attribute & non-commercial.
A Mechanized Model for CAN Protocols Context and objectives Our mechanized model Results Conclusions and Future Works Francesco Bongiovanni and Ludovic.
Scalable Distributed Reasoning Using MapReduce Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen Department of Computer Science, Vrije.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Semantic Web Final Exam Review. Topics for Final Exam First exam material (~30%) Design Patterns and Map/Reduce (~20%) Inference / Restrictions (~10%)
Data Structures and Algorithms in Parallel Computing Lecture 7.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Topologically-Aware Overlay Construction and Sever Selection Sylvia Ratnasamy, Mark Handley, Richard Karp, Scott Shenker.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Efficient Evaluation of XQuery over Streaming Data
Data Driven Resource Allocation for Distributed Learning
International Conference on Data Engineering (ICDE 2016)
Large-scale file systems and Map-Reduce
Applying Control Theory to Stream Processing Systems
Conception of parallel algorithms
Scaling SQL with different approaches
PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM
Sub-millisecond Stateful Stream Querying over
Optimizing Big-Data Queries using Program Synthesis
PA an Coordinated Memory Caching for Parallel Jobs
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Database Management Systems (CS 564)
MapReduce Simplied Data Processing on Large Clusters
DHT Routing Geometries and Chord
ece 720 intelligent web: ontology and beyond
On Spatial Joins in MapReduce
Cse 344 May 2nd – Map/reduce.
Akshay Tomar Prateek Singh Lohchubh
Prof. Leonardo Mostarda University of Camerino
Lu Xing CS59000GDM 9/21/2018.
Overview of big data tools
Talk in 4 parts Basic principles of the Semantic Web
Agenda for today 09: :00 Overview and Goals of LarKC, Frank van Harmelen 10: :30 Introduction to the LarKC Architecture, Spyros Kotoulas 10:30.
with Raul Castro Fernandez* Matteo Migliavacca+ and Peter Pietzuch*
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
DryadInc: Reusing work in large-scale computations
Scalable and Efficient Reasoning for Enforcing Role-Based Access Control
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

Very VERY large scale knowledge representation In collaboration with: Frank van Harmelen In collaboration with: Jacopo Urbani (VUA) Henri Bal (VUA)

Can we do this at “Web scale”? Reminder RDF(S) & OWLS are (weak) logics We can infer implicit statements from explicit ones subClass hierarchy Predicate types Cardinalities Transitive & symmetric relations Inconsistency ..... Can we do this at “Web scale”?

25 billion facts and counting

1 triple:

107 Triples Suez Canal Suez Canal 163 km 10^5 m Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 8 http://www.aifb.uni-karlsruhe.de/WBS

RDF Store subsecond querying 108 Triples Moon Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 9 http://www.aifb.uni-karlsruhe.de/WBS

~109 Triples Earth Earth Diameter: 10^5 km Volume: 10^21 km³ Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 10 http://www.aifb.uni-karlsruhe.de/WBS

~1010 Triples ≈ 1 triple per web-page Jupiter [LarKC proposal] ~1010 Triples ≈ 1 triple per web-page Jupiter ≈ 1 triple per web-page Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 11 http://www.aifb.uni-karlsruhe.de/WBS

~1011 Triples

Rephrase inference as Map-Reduce Map = group (key,val) pairs Reduce = process grouped pairs Inference = join : Map = group equal 1st or 3rd elements Reduce = perform joins A p C A q B D r D E r D F r C C 2 p 1 r 3 q 1 D 3 F 1 Map Reduce Jacopo Urbani <C,_,_> <A,_,_> . . <C,_,_> Map Reduce <F,_,_> Map-Reduce Map-Reduce Socrates a Human Human subClassOf Animal Socrates an Animal

WebPIE: Scalable reasoning in Hadoop deploy Map-Reduce on Hadoop platform Run on a cluster: 64 quad-core nodes cheap network (Gbit ethernet), cheap HD (1/node), limited memory (4GB/node) Jacopo Urbani

WebPIE performance: headlines compute the closure of 1.6B triples in < 2hr (Uniprot, OWL, 64 nodes) compute the closure of 100B triples in < 2 days (LUBM, OWL, 64 nodes) Linear scalability

WebPIE: performance Scalability (on the input size, using LUBM on 100 Billion triples) In this experiment we wanted to evaluate how WebPIE would perform if we increase the input. To test it, we always used the same number of machines (64) and generated LUBM data up to 100 billion triples. From the graph we see that the execution time increases linearly. This is good.

WebPIE: performance Scalability (on the number of nodes, up to 32 nodes) Here we tested how the performance would be if we increase the number of nodes. Therefore, we kept the size of the input constant (1B triples) and doubled the number of nodes. We notice that at the beginning the performance is superlinear but this can be misleading because in reality this is due to the fact that the Hadoop settings were not optimized for the execution on one machine and therefore that execution time is too penalized. In reality, the real performance is linear or even sublinear as it shown in the last part of the graph (look at the difference between 16 and 32 nodes).

Scalable reasoning in Hadoop

What to do for infinite scalability? (2/2) brain the size of a planet Eyal Oren anytime convergence (more complete over time)

What to do for infinite scalability? MarVIN: Divide – Conquer – Swap Split the input across peers Calculate the closure If you want more completeness, Goto 1. RDFS closure of 200M triples in 7 minutes. Approximate reasoning: full closure guarantee at  brain the size of a planet

Does this guarantee completeness? Questions: Does this guarantee completeness? Yes, theoretical model, experimentally verified Will this take forever? Yes, if triples are exchanged randomly No, if we can do something better 28-April-09

Random is inefficient Random scales badly Why is random routing inefficient? Random is inefficient Triples meet other triples randomly Most meetings are useless: inferences are sparse Random scales badly Useful meetings decrease as system size increases 28-April-09

Human subClassOf Animal Why is efficient routing difficult? Efficient: term-based partitions All triples with term x go to node y For inferencing, you need terms in common But will not work: Very skewed term distribution (Zipf) Load-balance will be too uneven Socrates a Human Human subClassOf Animal Socrates an Animal 28-April-09

Data clustering with SpeedDate DHT Random Speed Date 28-April-09

SpeedDate vs. other approaches We’re almost as good as a DHT 28-April-09

SpeedDate with various data distributions We can handle skewed data 28-April-09

SpeedDate under network churn We can handle node failures 28-April-09

SpeedDate scaling with system size We scale ~ sqrt(x) 28-April-09

Experimental speedup 28-April-09

. . A p C A q B D r D E r D F r C C 2 p 1 r 3 q 1 D 3 F 1 Map Reduce Jacopo Urbani Spyros Kotoulas compute Eyal Oren input data compute compute output data compute compute compute Divide-Conquer-Swap

Inference in Weak Logics at very, VERY large scale is possible Conclusion Inference in Weak Logics at very, VERY large scale is possible Future challenges: Incremental reasoning Stream reasoning Approximate reasoning (targetted incompleteness) Stronger Logics Cost predictions

Semantic Web Intro in 6 slides & a movie

http://www.youtube.com/watch?v=tBSdYi4EY3s

P1. Give all things a name

P2. Relations form a graph between things

P3. The names are addresses on the Web [<x> IsOfType <T>] x different owners & locations T <village>

P1+P2+P3 = Giant Global Graph

P4. explicit & formal semantics assign types to things assign types to relations organise types in a hierarchy empose constraints on possible interpretations

Examples of “semantics” married-to Frank Lynda married-to Hazel Frank is male married-to relates males to females married-to relates 1 male to 1 female Lynda = Hazel lowerbound upperbound Semantics = predictable inference