Very VERY large scale knowledge representation
Can we do this at “Web scale”? Reminder RDF(S) & OWLS are (weak) logics We can infer implicit statements from explicit ones subClass hierarchy Predicate types Cardinalities Transitive & symmetric relations Inconsistency ..... Can we do this at “Web scale”?
25 billion facts and counting
1 triple:
107 Triples Suez Canal Suez Canal 163 km 10^5 m Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 8
RDF Store subsecond querying 108 Triples Moon Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 9
~109 Triples Earth Earth Diameter: 10^5 km Volume: 10^21 km³ Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 10
~1010 Triples ≈ 1 triple per web-page Jupiter [LarKC proposal] ~1010 Triples ≈ 1 triple per web-page Jupiter ≈ 1 triple per web-page Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 11
~1011 Triples
Rephrase inference as Map-Reduce Map = group (key,val) pairs Reduce = process grouped pairs Inference = join : Map = group equal 1st or 3rd elements Reduce = perform joins A p C A q B D r D E r D F r C C 2 p 1 r 3 q 1 D 3 F 1 Map Reduce Jacopo Urbani <C,_,_> <A,_,_> . . <C,_,_> Map Reduce <F,_,_> Map-Reduce Map-Reduce Socrates a Human Human subClassOf Animal Socrates an Animal
WebPIE: Scalable reasoning in Hadoop deploy Map-Reduce on Hadoop platform Run on a cluster: 64 quad-core nodes cheap network (Gbit ethernet), cheap HD (1/node), limited memory (4GB/node) Jacopo Urbani
WebPIE performance: headlines compute the closure of 1.6B triples in < 2hr (Uniprot, OWL, 64 nodes) compute the closure of 100B triples in < 2 days (LUBM, OWL, 64 nodes) Linear scalability
WebPIE: performance Scalability (on the input size, using LUBM on 100 Billion triples) In this experiment we wanted to evaluate how WebPIE would perform if we increase the input. To test it, we always used the same number of machines (64) and generated LUBM data up to 100 billion triples. From the graph we see that the execution time increases linearly. This is good.
WebPIE: performance Scalability (on the number of nodes, up to 32 nodes) Here we tested how the performance would be if we increase the number of nodes. Therefore, we kept the size of the input constant (1B triples) and doubled the number of nodes. We notice that at the beginning the performance is superlinear but this can be misleading because in reality this is due to the fact that the Hadoop settings were not optimized for the execution on one machine and therefore that execution time is too penalized. In reality, the real performance is linear or even sublinear as it shown in the last part of the graph (look at the difference between 16 and 32 nodes).
Scalable reasoning in Hadoop
What to do for infinite scalability? (2/2) brain the size of a planet Eyal Oren anytime convergence (more complete over time)
What to do for infinite scalability? MarVIN: Divide – Conquer – Swap Split the input across peers Calculate the closure If you want more completeness, Goto 1. RDFS closure of 200M triples in 7 minutes. Approximate reasoning: full closure guarantee at brain the size of a planet
Does this guarantee completeness? Questions: Does this guarantee completeness? Yes, theoretical model, experimentally verified Will this take forever? Yes, if triples are exchanged randomly No, if we can do something better 28-April-09
Random is inefficient Random scales badly Why is random routing inefficient? Random is inefficient Triples meet other triples randomly Most meetings are useless: inferences are sparse Random scales badly Useful meetings decrease as system size increases 28-April-09
Human subClassOf Animal Why is efficient routing difficult? Efficient: term-based partitions All triples with term x go to node y For inferencing, you need terms in common But will not work: Very skewed term distribution (Zipf) Load-balance will be too uneven Socrates a Human Human subClassOf Animal Socrates an Animal 28-April-09
Data clustering with SpeedDate DHT Random Speed Date 28-April-09
SpeedDate vs. other approaches We’re almost as good as a DHT 28-April-09
SpeedDate with various data distributions We can handle skewed data 28-April-09
SpeedDate under network churn We can handle node failures 28-April-09
SpeedDate scaling with system size We scale ~ sqrt(x) 28-April-09
Divide-Conquer-Swap
Inference in Weak Logics at very, VERY large scale is possible Conclusion Inference in Weak Logics at very, VERY large scale is possible Future challenges: Incremental reasoning Stream reasoning Approximate reasoning (targetted incompleteness) Stronger Logics Cost predictions
Semantic Web Intro in 6 slides & a movie
P1. Give all things a name
P2. Relations form a graph between things
P3. The names are addresses on the Web [<x> IsOfType <T>] x different owners & locations T <village>
P1+P2+P3 = Giant Global Graph
P4. explicit & formal semantics assign types to things assign types to relations organise types in a hierarchy empose constraints on possible interpretations
Examples of “semantics” married-to Frank Lynda married-to Hazel Frank is male married-to relates males to females married-to relates 1 male to 1 female Lynda = Hazel lowerbound upperbound Semantics = predictable inference