Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 A Sitemap extension to enable efficient interaction with large quantities of Linked Data Giovanni Tummarello, Ph.D DERI Galway
2 Linked Data on the Semantic Web The “Semantic Web”, how we start to mean it today: The set composed by all the RDF models which can be resolved by a URL (source). Size of the current of the current Semantic Web: m documents Most of it produced by mapping relational databases using the “linked data” approach: –The identifier (URI) is actually a URL. We call these URI/URLs –..Minted in the same namespace of the data producer.. –So that the data producers Web server can generate a description of the entity, when this is “resolved”, e.g. via HTTP –Example
3 Cost of creating new documents of on the SW If you have the data, is moderately low From your existing DB, apply a layer (e.g. D2R or Viruoso) Produce as many RDF files retrievable from your URL prefix as your entities Success? –More is needed to make your data useful (e.g. linking to OTHER URIs if your entities are not something completely “yours”) –Need to make the world know your data is there.
4 Large quantities of linked data: how to expose? The fact that the data is HTTP retrievable in small bits makes it crawlable. But data producers are very scared of this: –Million of hits for each refresh –Each hit triggers potentially many complex query to generate the RDF view of the entity –DOS on the SW have happened (e.g. See Geonames blog) and they are not fun. And clearly something better must be possible –Most data producers do in fact already provide full dumps of the base data –Or SPARQL endpoints
5 The idea: Extending Sitemaps to expose data Sitemaps: –Originally by Google, immediately adopted by all (Yahoo, MSN) etc –Expose the “deep web”, by providing a list of pages “to be crawled” –Written in XML, Linked directly in the robot.txt Example: monthly 0.8
6 The Semantic Sitemap Extention Example first: Product Catalog for Example.org monthly
7 The Semantic Sitemap Extention Example first: Product Catalog for Example.org monthly
8 The Semantic Sitemap Extention Example first: Product Catalog for Example.org monthly
9 The Semantic Sitemap Extention Example first: Product Catalog for Example.org monthly
10 The Semantic Sitemap Extention Example first: Product Catalog for Example.org monthly
11 The Semantic Sitemap Extention Example first: Product Catalog for Example.org monthly
12 Other features Location of the sparql endpoint of the dataset A reppresentative URI/URL Split data dumps
13 How it is meant to be used As a crawler: If you are given a URL for an RDF site check for the sitemap If a dump is available, download that instead As a client: If you have a dump, and want an update Check the sitemap, to locate it in case it has changed position Or to locate a SPARQL endpoint
14 Dumps (1) Tripledumps vs Quaddumps The Semantic Web, is a quadruple space (triple+source) A Semantic Web site dump should therefore be in a quad format But almost always, the only thing that really matters is a single triplestore How to “slice” such a dataset to obtain the individual linked data files ? –The individual site owners decide how to generate the single linked data files. –Unfortunately there is no standard interpretation of SPARQL describe –Some reasonable choices exist however but might fail for specific use cases –Guess work or standardization?
15 Dumps (2) Compression and others In case of a tripledump, one should specify the format such as: rdf/xml, ntriples, turtle, n3 In case of a quaddump: –Trig, Trix, Nquad –filename Archival – Archives where the filenames are created by URL encoding the source location, Compression: The can be compressed, in this case one of the following formats should be specified: –Tar, zip, gzip, bzip2, targzip, tarbz2
16 Who uses it? Data producers Geonames DBpedia Uniprot DBLP … (takes 10 minutes to do one..) Data consumers Sindice Next: SWSE, DBin 2.0
17 Implementation in action: Sindice Can help a user or a client (e.g. Tabulator) to find useful Semantic Web Sources to import. Quick to update, monitors changes, crawls (soon) First beta target: to index the currently known Semantic Web Discovers, and uses Semantic Sitemaps
18 Sindice scenario DBLP Disco, Piggy Bank, SIOC Explorer etc.. The tabulator GeoNames DBPedia
19 Semantic Sitempas: credits Also thanks to: Chris Bizer (Free University Berlin) Richard Cyganiak (Free University Berlin) Renaud Delbru (DERI Galway) Andreas Harth (DERI Galway) Aidan Hogan (DERI Galway) Leandro Lopez Stefano Mazzocchi (SIMILE- MIT) Christian Morbidoni (SEMEDIA - Universita' Politecnica delle Marche) Michele Nucci (SEMEDIA - Universita' Politecnica delle Marche) Eyal Oren (DERI Galway) Leo Sauermann (DFKI)
20 Conclusions Sitemaps are born in the document web, explicitly to expose databases and the “inner web” The idea: a Semantic Sitemap extention to covers efficient handling of RDF datasets by clients and search engines Details to be somehow polished, but it works already Full specs at