DBpedia: A Nucleus for a Web of Open Data Original presentation by Christian Bizer, Freie Universität Berlin Sören Auer , Universität Leipzig Georgi Kobilarov, Freie Universität Berlin Jens Lehmann, Universität Leipzig Richard Cyganiak, Freie Universität Berlin Edited by Sangkeun Lee
DBpedia.org is a effort to : extract structured information from Wikipedia make this information available on the Web under an open license interlink the DBpedia dataset with other datasets on the Web
Outline: 1. Extracting Structured Information from Wikipedia 2. The DBpedia Dataset 3. Accessing the DBpedia Dataset over the Web 4. Use Cases: Improving Wikipedia Search Royalty-Free Data Source for other Applications Nucleus for the Emerging Web of Data
Title Abstract Infoboxes Geo-coordinates Categories Images Links Other languages Other wiki pages To the web Redirects Disambiguates
Extracting Structured Information from Wikipedia Wikipedia consists of 6.9 million articles in 251 languages monthly growth-rate: 4% Wikipedia articles contain structured information infoboxes which use a template mechanism images depicting the article’s topic categorization of the article links to external webpages intra-wiki links to other articles inter-language links to articles about the same topic in different languages
Overview of the DBpedia component
Traditional Web Browser Web 2.0 Mashups Semantic Web Browsers SPARQL Endpoint Linked Data SNORQL Browser Query Builder Virtuoso Articles MySQL Infobox Categories Wikipedia Dumps DB tables Article texts DBpedia datasets loaded into published via Extraction
Wikitext Syntax:
Extracting Infobox Data (RDF Representation): http://en.wikipedia.org/wiki/Calgary http://dbpedia.org/resource/Calgary dbpedia:native_name Calgary”; dbpedia:altitude “1048”; dbpedia:population_city “988193”; dbpedia:population_metro “1079310”; mayor_name dbpedia:Dave_Bronconnier ; governing_body dbpedia:Calgary_City_Council; ...
How good is the extraction from the markup in Wiki pages? Question: How good is the extraction from the markup in Wiki pages?
Short and long abstracts in 10 different languages dbpedia:Calgary dbpedia:abstract “Calgary is the largest ...”@en ; dbpedia:abstract “Calgary ist eine Stadt ...”@de . Categorization information skos:subject dbpedia:Category_Cities_in_Alberta ; skos:subject dbpedia:Host_cities_Olympic_Games . Links to the original Wikipedia articles, pictures and relevant external web pages foaf:page <http://en.wikipedia.org/wiki/Calgary> ; dbpedia:wikipage-de<http://de.wikipedia.org/wiki/Calgary> ; foaf:depiction <http://upload.wikimedia.org/thumb/3/32> ; dbpedia:reference <http://www.calgary.ca> ; dbpedia:reference <http://www.tourismcalgary.com>.
DBpedia Basics : The structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content. The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data. At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.
The DBpedia Dataset 1,600,000 concepts including 58,000 persons 70,000 places 35,000 music albums 12,000 films described by 91 million triples using 8,141 different properties. 557,000 links to pictures 1,300,000 links external web pages 207,000 Wikipedia categories 75,000 YAGO categories
Accessing the DBpedia Dataset over the Web 1. SPARQL Endpoint 2. Linked Data Interface 3. DB Dumps for Download
SPARQL : SPARQL is a query language for RDF. RDF is a directed, labeled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.
The DBpedia SPARQL Endpoint http://dbpedia.org/sparql hosted on a OpenLink Virtuoso server can answer SPARQL queries like Give me all Sitcoms that are set in NYC? All tennis players from Moscow? All films by Quentin Tarentino? All German musicians that were born in Berlin in the 19th century?
Interesting Example: entities Table To know everything Bart wrote on blackboard board in season 12 of Simpson's: The Simpson episode Wikipedia pages are the identified "things" that we would consider as the subjects of our RDF triples. The bottom of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12". The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field. entities SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>. ?episode dbpedia2:blackboard ?chalkboard_gag } Table
The Linked Data Interface: A large body of information and knowledge is often already available in structured form, yet not accessible as such on the Web. Integrating open data provides real value. It saves the time and effort to re-enter data that is already out there and it leaves the data and editing where it belongs: at its origin. Linked Data on the Web can be accessed using Semantic Web browsers, just as the traditional Web of documents is accessed using HTML browsers. Semantic Web browsers enable users to navigate between different data sources by following RDF links. It also allows the robots of Semantic Web search engines to follow these links to crawl the Semantic Web.
The Linked Data Interface The project follows the Linked Data principles All concepts are identified using Uniform Resource Identifier references. URI is a compact string of characters used to identify or name a resource. The Linked Data interface can be used by Semantic Web Browsers, like - DISCO Hyperdata Browser - Tabulator Browser - OpenLink RDF Browser Semantic Web Crawlers, like - Zitgist (Zitgist LLC, USA) - SWSE (DERI, Ireland) - Swoogle (UMBC, USA )
DBpedia Use Cases 1. Improving Wikipedia Search 2. Royalty-Free Data Source for other Applications 3. Nucleus for the Emerging Web of Data
Improving Wikipedia Search (Various Interfaces)
Query to find all web browser S/W at http://wikipedia.askw.org :
Improving Wikipedia Search
Royalty-Free Data Source for other Applications DBpedia is published under GNU Free Documentation License Example use case: SPARQL generated tables within webpages Royalty-Free Data Source for other ApplicationsRoyalty-Free Data Source for other Applications
Nucleus for the Emerging Web of Data W3C SWEO Linking Open Data Project Over all size of the dataset: over 1 billion RDF triples Out-bound RDF links within DBpedia: 75,000
Proposed Improvements: Better data cleansing required. Improvement in the classification. Interlink DBpedia with more datasets. Improvement in the user interfaces. Performance Scalability More Expressiveness
Discussion DBpedia is the first and largest source of structured data on the Internet covering topics of general knowledge. DBpedia gains new information when it extracts data from the latest Wikipedia dump, whereas Freebase, in addition to Wikipedia extractions, gains new information through its userbase of editors. Which one is better approach? Can Freebase or DBpedia be substitute for Wikipedia? Freebase : Not good in that we have two similar things – Wikipedia, Freebase DBPedia : Not good in that it extracts data from dump How can we interlink Freebase & DBpedia? What can be killer applications using Dbpedia? If there is, okay If there is no, do we really need a large general structured knowledge?