A Schema and Instance Based RDF Dataset Summarization Tool

A Schema and Instance Based RDF Dataset Summarization Tool
Jiaxuan Zou Websoft NJU 2018/11/20

Outline Background and Problem Definition Methodology System
Evaluation 2018/11/20

What is a RDF dataset A RDF dataset can be regarded as a set of knowledge and RDF data. Knowledge from ontologies. Instance RDF data about real world things. An ontology is: A vocabulary. A formal explicit specification of a shared conceptualization. To express and share knowledge. 2018/11/20

Linking Open Data 295 datasets, over 31 billion RDF triples by September 2011. 2018/11/20

Problem comes… More and more datasets are published.
Large scale. A dataset often contains millions of RDF triples, sometimes even billions. Which is hard for people to understand and use them. Therefore, a dataset summarization tool is needed. 2018/11/20

What is dataset summarization
Process of distilling the most important knowledge and instance data from a dataset to produce an abridged version for users and tasks. Inspired by the definition of “text summarization” and “ontology summarization” Which can be transformed to a process of presenting, ranking and selecting the contents of a dataset. 2018/11/20

Process of RDF dataset summarization
Present Rank Select …… …… The original contents of a Dataset Term set Summarization Term sequence 2018/11/20

Content Presenting A dataset can be presented by a set of basic terms.
These terms should: Have a universal structure. Contain both schema information and instance data of the dataset. Use pair (RDF sentence, relevant instance RDF triples) to be the basic term. 2018/11/20

RDF sentence A RDF sentence is a set of RDF triples which share a common blank node. A blank node stands for a kind of existentially quantified resources. Those RDF triples comprise a complete semantic. Use RDF sentences to indicate schema information. 2018/11/20

Generate RDF sentence Although may not explicitly given, a dataset uses part of one or more ontologies to construct its schema. Generate the schema graph by the “rdf:type” property. In the schema graph, generate RDF sentences by a DFS algorithm which alters a little for the blank node situation. Recall the example, imagine we start at S1…… 2018/11/20

Relevant instance RDF triples
The RDF triples instantiated from a RDF sentence are called relevant instance RDF triples of the RDF sentence. A simple sentence. A complex sentence. 2018/11/20

Ranking Importance of a term’s RDF sentence, It.s.
Degree-based. PageRank-based. HITS-based. Importance of relevant instance RDF triples, It.Rs. Information redundancy and coverage based re-rank. 2018/11/20

RDF sentence graph If we regard a RDF sentence as a vertex in a directed graph, it will have 2 kinds of links: Sequential link. Coordinate link 2018/11/20

RDF sentence graph(cont.)
Make G<V, E> as the RDF sentence graph. For a sentence s ∈ V, Make set Ns indicate the arcs starting from s. Make set Bs indicate the arcs pointing to s. 2018/11/20

Degree-based Importance
For a RDF sentences s ∈ V. Its in-degree In(s) = |Bs|. Its out-degree Out(s) = |Ns|. IDegree(s) = ( In(s) + Out(s)) / C 2018/11/20

PageRank-based Importance
Make θ as the convergence threshold. Initial: IPageRank(s) = IDegree(s). Iteration: IPageRank(s) = Σ(IPageRank(v) / |Nv|), for ∀ v ∈ Bs Until achieve the convergence. 2018/11/20

HITS-based Importance
Make θ as the convergence threshold, a(s) and h(s) to indicate the authority and hub of s. Initial: a(s) = In(s)，h(s) = Out(s) Iteration: a(s) = Σh(v), for ∀ v ∈ Bs. h(s) =Σa(v), for ∀ v ∈ Ns. Until achieve the convergence. IHITS(s) = a(s)。 2018/11/20

Relevant instance importance
Assumption: each relevant instance is of the same importance. Large number. Sparse connections. For a term t. It.Rs = |t.Rs| / n, n is the total num. Furthermore, if t’s RDF sentence contains more than 1 triple: It.Rs = It.Rs / m, m = |t.s|. Therefore, the importance of a term It = αIt.s + (1-α)It.Rs . α is between [0,1]. 2018/11/20

Coverage-based re-rank
Only considering importance may lead to information redundancy. 2 RDF sentence are inverse. (Record , isMadeBy, MusicArtist) vs (MusicArtist, made, Record) Over focused on few entities. A re-rank based on coverage, mainly 2 kinds of punishments. A very large punishment when inverse situation occurs. A medium punishment for each time the subject of a term’s RDF sentence has already occurred in the result sequence. 2018/11/20

Selecting It’s an open question to determine choose how many terms to comprise a summarization.. Let users decide: Simple or |S|(if |S| < 10) terms. S is the RDF sentence set. Medium % * |S| or 10(if 20% * |S| <10) or |S|(if |S| < 10 ). Detailed. 50% * |S| or 10(if 50% * |S| <10) or |S|(if |S| < 10 ) 2018/11/20

System Architecture 2018/11/20

The WinRAR package The system is published by a WinRAR package.
Users can download the package and execute the .jar file. The “model” folder contains some data models. Generated when each dataset is firstly summarized. To accelerate the future summarizations. 2018/11/20

The initial UI The initial UI: 2018/11/20

The file choosing UI After pressing “open” button: 2018/11/20

Cold boot time The X axis is the size of dataset and its unit is k-triples. The Y axis is the time needed and its unit is second. 2018/11/20

Main summary UI “Jamendo” is a music dataset mainly based in France.
2018/11/20

Main summary UI(Cont.) After changing α and scale. 2018/11/20

Main summary UI(Cont.) After choosing HITS. 2018/11/20

Term’s detail UI When choosing a term for further exploration.
2018/11/20

Evaluation It’s a hard task to evaluate a summary since one can’t build a universal standard. Evaluate the tool based on 2 aspects: Performance. Functionality. 2018/11/20

Performance evaluation
Artificially generate a summary with 10 terms of Jamendo dataset as a so-called “Golden Standard”. Use it to evaluate the performance of summarization results. Performance here mainly means accuracy. The accuracy can be evaluated in 2 ways: Order-independent. Order-dependent. 2018/11/20

Order-independent P(t) = K∗θ. θ=1 if occurs else θ=0. 2018/11/20

Order-dependent P(t) = K∗θ∗(1−∆) ∆ = r t −o t /10. 2018/11/20

Functionality evaluation
Evaluated by users’ feedbacks. An experimental dataset with 6000 RDF triples is built from LinkedMDB. Users only explore the result summary. 非常不同意比较不同意一般同意较为同意非常同意摘要呈现简洁、清晰，容易理解根据摘要，我能快速地理解数据集的主要内容我可以仅根据摘要开发简单的数据集相关应用，而不需全面地浏览数据集 2018/11/20

Functionality evaluation(cont.)
After exploring the whole experimental dataset. 非常不同意比较不同意一般同意较为同意非常同意摘要正确地反映了数据集的主要内容综合多个方面来考虑，该摘要是一份优秀的摘要 2018/11/20

Your questions are appreciated!
Thanks!! Your questions are appreciated! 2018/11/20

A Schema and Instance Based RDF Dataset Summarization Tool

Similar presentations

Presentation on theme: "A Schema and Instance Based RDF Dataset Summarization Tool"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Schema and Instance Based RDF Dataset Summarization Tool

Similar presentations

Presentation on theme: "A Schema and Instance Based RDF Dataset Summarization Tool"— Presentation transcript:

Similar presentations

About project

Feedback