Linked (Open) Data Speaker: 呂瑞麟國立中興大學資訊管理學系教授

Linked (Open) Data Speaker: 呂瑞麟國立中興大學資訊管理學系教授
URL:

Linked Data Linked Data is about using the Web to connect related data that wasn't previously linked, or using the Web to lower the barriers to linking data currently linked using other methods Wikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF."

FAQ Linked Data vs. Semantic Web!!
Tim Berners-Lee, inventor of the Web and the person credited with coining the terms Semantic Web and Linked Data has frequently described Linked Data as "the Semantic Web done right“ But others may not agree.

Linked Data Principles
The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. Use URIs as names for things Use HTTP URIs so that people can look up those names. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) Include links to other URIs (other datasets). so that they can discover more things. Source:

RDF Links Two principal types of RDF triples can be distinguished
Literal Triples: have an RDF literal such as a string, number, or date as the object. Ex. (< dc:creator, “Eric Jui-Lin Lu”) RDF Links: represent typed links between two resources. RDF links consist of three URI references. The URIs in the subject and the object position of the link identify the interlinked resources. The URI in the predicate position defines the type of the link. Ex. (< dc:creator, <

RDF Links Generations Create manually Auto-generation
A well-known problem in DB community called Record Linkage A good research topic to be discussed?! A good starting point:

Is your data 5 star?

Linking Open Data Project
Goal: “expose” open datasets in RDF Set RDF links among the data items from different datasets Set up query endpoints Altogether billions of triples, millions of links… The important point here is that (1) the data becomes available to the World via a unified format (ie, RDF), regardless on how it is stored inside and (2) the various datasets are interlinked together, ie, they are not independent islands. Dbpedia is probably the most important 'hub' in the project. Source:

The LOD “cloud”, March 2008 Source:

The LOD “cloud”, 09/2010 Source:

The LOD “cloud”, 08/2014

LOD Catagories When we look more closely at widely deployed vocabularies in the LOD cloud, we can group the semantic link types into person-related link types , such as foaf:knows from the Friend of a Friend (FOAF) vocabulary ( spatial link types , such as geo:lat from the Basic Geo vocabulary (WGS84 lat/long; temporal link types , such as Dublin Core’s dc:created property ( or the Event Ontology’s event:time property ( link types such as dc:isPartOf for representing structural semantics; other link types, such as scovo:dimension from the Statistical Core Vocabulary (SCOVO; Source: IEEE Internet Computing 2009,

Example LODs Wikipedia based:
DBpedia Wikidata: was known as Freebase The Data-Gov Wiki: linked data for US open data. YAGO: YAGO is a huge semantic knowledge base, derived from Wikipedia, WordNet, and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities … Source:

DBpedia DBpedia is a community effort to
extract structured (“infobox”) information from Wikipedia provide a query endpoint to the dataset interlink the DBpedia dataset with other datasets on the Web Source:

Infobox : 包含該文簡短重要的相關資訊。
Articles : 285種語言英文版本包含2.4Millions　articles Infobox : 包含該文簡短重要的相關資訊。

Disambiguation pages :
會列出非預設搜尋的頁面

Redirects : 重新定向頁面，找尋最有可能的結果。例如：搜尋”USA”，自動找到左上角的article

Hyperlink : 包含連結到其他articles的links

DBpedia framework

Extraction Manager Extractors：把wiki標記成triples的特殊型態。
Parser :支援提取已經確定的資料型態，轉換不同單位的值到列表中。

Dump-based extraction
Monthly dump files For English,

Live extraction Wikimedia Foundation給予DBpedia project使用OAI-PMH live feed可以立即報告所有維基百科的變化，這個階段的工作就是利用更新的數據來抽取新的RDF(就是在維基百科文章有改變的時候而執行)。

11 extractors Labels. Abstracts. Interlanguage links.
All Wikipedia articles have a title, which is used as an rdfs:label for the corresponding DBpedia resource. Abstracts. We extract a short abstract (first paragraph, represented using rdfs:comment) and a long abstract (text before a table of contents, at most 500 words, using the property dbpedia:abstract) from each article. Interlanguage links. We extract links that connect articles about the same topic in different language editions of Wikipedia and use them for assigning labels and abstracts in different languages to DBpedia resources.

11 extractors Images. Redirects. Disambiguation.
Links pointing at Wikimedia Commons images depicting a resource are extracted and represented using the foaf:depiction property. Redirects. In order to identify synonymous terms, Wikipedia articles can redirect to other articles. We extract these redirects and use them to resolve references between DBpedia resources. (dbo:wikiPageRedirects) Disambiguation. Wikipedia disambiguation pages explain the different meanings of homonyms. We extract and represent disambiguation links using the predicate dbo:wikiPageDisambiguates.

11 extractors External links. Pagelinks. Homepages.
Articles contain references to external Web resources which we represent using the DBpedia property dbpedia:reference. (dbo:wikiPageExternalLink?) Pagelinks. We extract all links between Wikipedia articles and represent them using the dbpedia:wikilink property. Homepages. This extractor obtains links to the homepages of entities such as companies and organisations by looking for the terms homepage orwebsite within article links (represented using foaf:homepage).

11 extractors Categories. Geo-coordinates.
Wikipedia articles are arranged in categories, which we represent using the SKOS vocabulary. Categories become skos:concepts; category relations are represented using skos:broader. Geo-coordinates. The geo-extractor expresses coordinates using the Basic Geo (WGS84 lat/long) Vocabulary and the GeoRSS Simple encoding of the W3C Geospatial Vocabulary. The former expresses latitude and longitude components as separate facts, which allows for simple areal filtering in SPARQL queries.

Infobox extraction

Infobox extraction Wikipedia’s infobox template system has evolved over time without central coordination Different templates use different names for the same attribute (前兩頁的紅色框框） attribute values are expressed using a wide range of different formats and units of measurement (前兩頁的藍色框框） Two approaches were taken Generic infobox extraction Mapping-based infobox extraction

Generic infobox extraction
The corresponding DBpedia URI of the Wikipedia article is used as subject. The predicate URI is created by concatenating the namespace fragment and the name of the infobox attribute. Objects are created from the attribute value. Property values are post-processed in order to generate suitable URI references or literal values. This includes recognizing MediaWiki links, detecting lists, and using units as datatypes

Generic infobox extraction
The advantage complete coverage of all infoboxes and infobox attributes. The main disadvantages synonymous attribute names are not resolved, which makes writing queries against generic infobox data rather cumbersome. As Wikipedia attributes do not have explicitly defined datatypes, a further problem is the relatively high error rate of the heuristics that are used to determine the datatypes of attribute values.

Mapping-based infobox extraction
We mapped Wikipedia templates to an ontology. This ontology was created by manually arranging the 350 most commonly used infobox templates within the English edition of Wikipedia into a subsumption hierarchy consisting of 170 classes and then mapping 2350 attributes from within these templates to 720 ontology properties.

DBpedia knowledge base
Identifying entities Classifying entities Describing entities

Identifying entities DBpedia uses English article names for creating identifiers. Information from other language versions of Wikipedia is mapped to these identifiers by bi-directionally evaluating the interlanguage links between Wikipedia articles Resources are assigned a URI according to the pattern where Name is taken from the URL of the source Wikipedia article, which has the form

Classifying entities DBpedia entities are classified within four classification schemata Wikipedia Categories YAGO UMBEL DBpedia Ontology

以中興大學為例

DBpedia分類方式(1/2) 以中興大學的DBpedia為例，在rdf:type這邊通常會放著他的分類
DBpedia Ontology Wikipedia Categories YAGO UMBEL 但並不是每一個DBpedia resource 都包含這四種分類

DBpedia分類方式(2/2) DBpedia Ontology(1/2) 6層樹狀結構 521種類別
超過900種有domain 和range定義的屬性

DBpedia Ontology(2/2) 1650 種不同的properties

Describing entities Every DBpedia entity is described by a set of general properties and a set of infobox-specific properties The general properties include a label, a short and a long English abstract, a link to the corresponding Wikipedia article (if available) geo-coordinates, a link to an image depicting the entity, links to external Web pages, and links to related DBpedia entities.

Automatic links among open datasets
Source: < owl:sameAs < ; owl:sameAs < ; ... < owl:sameAs < wgs84_pos:lat “ ” ; wgs84_pos:long “ ” ; geo:inCountry < ; ... This is the important point: this is why this is Linking open data. What the community does (and this is a community project) is to make agreements on how the different datasets can be linked and then the linkage is done automatically by the 'bridges' that map the public datasets into RDF. Eg, geonames is a public dataset by geonames.org and dbpedia is, well... wikipedia:-) Processors can switch automatically from one to the other…

DBpedia Dbpedia 39 (2014; 含 en, zh, and links)
Rdf triples: 440,516,428 N-triple 格式 (2015; 含 core, core-i18n/en&zh) Rdf triples: 411,885,960 (2016; 含 core, core-i18n/en&zh) Rdf triples: 396,764,959 Turtle 格式

DBpedia Latest development
For DBpedia version, the most notable is the addition of the NIF annotation datasets for each language, recording the whole wiki text, its basic structure (sections, titles, paragraphs, etc.) and the included text links. We hope that researchers and developers, working on NLP-related tasks, will find this addition most rewarding. The DBpedia Open Text Extraction Challenge (for SEMANTiCS 2017) was introduced to instigate new fact extraction based on these datasets.

NIF NIF: the NLP Interchange Format
NIF is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. To be added in the future

Applications of LOD Browsers Search Engines and Indexing Others

LOD Browsers Tabulator and Marbles (?)
Source:

LOD Search Engines and Indexing
Falcons and SWSE (?) Source:

Example: Social bookmarking
Source: IEEE Internet Computing 2009,

Example: Dbpedia mobile
Source: IEEE Internet Computing 2009,

Example: BBC Music Source: IEEE Internet Computing 2009,

No Killing Applications!!
Why? Many developments were stopped, although they are important No Killing Applications!!

Research Challenges User Interfaces and Interaction Paradigms
For example, in browsers, how entities and links are explored Application-specific research topics queries can be answered against the Web of Data by relying on (runtime link traversal) Ex. Semantic-based P2P Systems, Question Answering over Linked Data (QALD) 52

Research Challenges Links maintenance Schema mapping and data fusing
For better data integration, mapping terms between different vocabularies is required Current RDF Schema and OWL allow limited coarse-grained mappings. More fine-grained solution is preferred fusing data about the same entity from different sources, by resolving data conflicts Links maintenance 53

Research Challenges Licensing and Charge Trust, Quality, and Relevance
as (Miller et al., 2008) discuss, copyright law is not applicable to data, which from a legal perspective is also treated differently across jurisdictions. Need a payment mechanism for data services Trust, Quality, and Relevance Also a research issue in traditional web. Privacy Ex. In Apache, you design robots.txt require a combination of technical and legal means together with a higher awareness of the users about what data to provide in which context. 54

Research Challenges Knowledge Extraction
For examples: extract knowledge from Wikipedia to DBpedia, YAGO, etc. Ontology generation 55

References Christian Bizer, Tom Heath and Tim Berners-Lee (in press). Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems, Special Issue on Linked Data. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann, DBpedia - A crystallization point for the Web of Data, Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 7, no. 3, Sep. 2009, p 56

Linked (Open) Data Speaker: 呂瑞麟國立中興大學資訊管理學系教授

Similar presentations

Presentation on theme: "Linked (Open) Data Speaker: 呂瑞麟國立中興大學資訊管理學系教授"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linked (Open) Data Speaker: 呂瑞麟 國立中興大學資訊管理學系教授

Similar presentations

Presentation on theme: "Linked (Open) Data Speaker: 呂瑞麟 國立中興大學資訊管理學系教授"— Presentation transcript:

Similar presentations

About project

Feedback

Linked (Open) Data Speaker: 呂瑞麟國立中興大學資訊管理學系教授

Presentation on theme: "Linked (Open) Data Speaker: 呂瑞麟國立中興大學資訊管理學系教授"— Presentation transcript: