A Schema and Instance Based RDF Dataset Summarization Tool

Slides:



Advertisements
Similar presentations
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Advertisements

A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
IP Fast Reroute Using Tunnel-AT draft-xu-ipfrr-tunnelat-00 Mingwei Xu, Lingtao Pan, Qing Li Tsinghua University, China 75 th IETF Meeting, Stockholm July.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Designing Indexing Structure for Discovering Relationships in RDF Graphs Stanislav Bartoň.
Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING.
Search Engines and Information Retrieval
Ontology Summarization Based on RDF Sentence Graph Written by: Xiang Zhang, Gong Cheng, Yuzhong Qu Presented by: Sophya Kheim.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
10 December, 2013 Katrin Heinze, Bundesbank CEN/WS XBRL CWA1: DPM Meta model CWA1Page 1.
READING QUESTION TYPES
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
Search Engines and Information Retrieval Chapter 1.
2-Oct-15 Bojan Orlic, TU/e Informatica, System Architecture and Networking 12-Oct-151 Homework assignment 1 feedback Bojan Orlic Architecture.
Automated Creation of a Forms- based Database Query Interface Magesh Jayapandian H.V. Jagadish Univ. of Michigan VLDB
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
1 Hypermedia Design Models & Methodologies Dr Gary Wills IAM Research Group © University of Southampton.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
On the Ability of Graph Coloring Heuristics to Find Substructures in Social Networks David Chalupa By, Tejaswini Nallagatla.
Knowledge Representation Part I Ontology Jan Pettersen Nytun Knowledge Representation Part I, JPN, UiA1.
The Emergent Structure of Development Tasks
Object Management Group Information Management Metamodel
Roberta Roth, Alan Dennis, and Barbara Haley Wixom
Outline Types of Databases and Database Applications Basic Definitions
Cross-Ontological Relationships
Data Virtualization Community Edition
LANGUAGE CURRICULUM DESIGN
HITS Hypertext-Induced Topic Selection
Presented by: Hassan Sayyadi
Big Data Quality the next semantic challenge
Data Virtualization Community Edition
SIS: A system for Personal Information Retrieval and Re-Use
Getting started With Linked Data.
HCI – DESIGN RATIONALE 20 November 2018.
NJVR: The NanJing Vocabulary Repository
Chapter 9 Structuring System Requirements: Logic Modeling
Rafael Almeida, Inês Percheiro, César Pardo, Miguel Mira da Silva
Representation of documents and queries
Introduction into Knowledge and information
Discriminative Frequent Pattern Analysis for Effective Classification
[jws13] Evaluation of instance matching tools: The experience of OAEI
Big Data Quality the next semantic challenge
Resolution Proofs for Combinational Equivalence
TECHNICAL REPORT.
RDF graph summaries 金成 2014/11/3.
Ontology-Based Approaches to Data Integration
Searching with context
Helping Students Generate and Test Hypotheses
Khadija Elbedweihy, Stuart N. Wrigley, and Fabio Ciravegna
Section VI: Comprehension
Danyun Xu, Gong Cheng*, Yuzhong Qu
Lecture 6: How to Read an Academic Paper
Semantic Nets and Frames
Chapter 9 Structuring System Requirements: Logic Modeling
Filtering Properties of Entities By Class
Template-based Question Answering over RDF Data
Probabilistic Databases with MarkoViews
Big Data Quality the next semantic challenge
Presentation transcript:

A Schema and Instance Based RDF Dataset Summarization Tool Jiaxuan Zou Websoft NJU 2018/11/20

Outline Background and Problem Definition Methodology System Evaluation 2018/11/20

Outline Background and Problem Definition Methodology System Evaluation 2018/11/20

What is a RDF dataset A RDF dataset can be regarded as a set of knowledge and RDF data. Knowledge from ontologies. Instance RDF data about real world things. An ontology is: A vocabulary. A formal explicit specification of a shared conceptualization. To express and share knowledge. 2018/11/20

Linking Open Data 295 datasets, over 31 billion RDF triples by September 2011. 2018/11/20

Problem comes… More and more datasets are published. Large scale. A dataset often contains millions of RDF triples, sometimes even billions. Which is hard for people to understand and use them. Therefore, a dataset summarization tool is needed. 2018/11/20

What is dataset summarization Process of distilling the most important knowledge and instance data from a dataset to produce an abridged version for users and tasks. Inspired by the definition of “text summarization” and “ontology summarization” Which can be transformed to a process of presenting, ranking and selecting the contents of a dataset. 2018/11/20

Process of RDF dataset summarization Present Rank Select …… …… The original contents of a Dataset Term set Summarization Term sequence 2018/11/20

Outline Background and Problem Definition Methodology System Evaluation 2018/11/20

Content Presenting A dataset can be presented by a set of basic terms. These terms should: Have a universal structure. Contain both schema information and instance data of the dataset. Use pair (RDF sentence, relevant instance RDF triples) to be the basic term. 2018/11/20

RDF sentence A RDF sentence is a set of RDF triples which share a common blank node. A blank node stands for a kind of existentially quantified resources. Those RDF triples comprise a complete semantic. Use RDF sentences to indicate schema information. 2018/11/20

Generate RDF sentence Although may not explicitly given, a dataset uses part of one or more ontologies to construct its schema. Generate the schema graph by the “rdf:type” property. In the schema graph, generate RDF sentences by a DFS algorithm which alters a little for the blank node situation. Recall the example, imagine we start at S1…… 2018/11/20

Relevant instance RDF triples The RDF triples instantiated from a RDF sentence are called relevant instance RDF triples of the RDF sentence. A simple sentence. A complex sentence. 2018/11/20

Ranking Importance of a term’s RDF sentence, It.s. Degree-based. PageRank-based. HITS-based. Importance of relevant instance RDF triples, It.Rs. Information redundancy and coverage based re-rank. 2018/11/20

RDF sentence graph If we regard a RDF sentence as a vertex in a directed graph, it will have 2 kinds of links: Sequential link. Coordinate link 2018/11/20

RDF sentence graph(cont.) Make G<V, E> as the RDF sentence graph. For a sentence s ∈ V, Make set Ns indicate the arcs starting from s. Make set Bs indicate the arcs pointing to s. 2018/11/20

Degree-based Importance For a RDF sentences s ∈ V. Its in-degree In(s) = |Bs|. Its out-degree Out(s) = |Ns|. IDegree(s) = ( In(s) + Out(s)) / C 2018/11/20

PageRank-based Importance Make θ as the convergence threshold. Initial: IPageRank(s) = IDegree(s). Iteration: IPageRank(s) = Σ(IPageRank(v) / |Nv|), for ∀ v ∈ Bs Until achieve the convergence. 2018/11/20

HITS-based Importance Make θ as the convergence threshold, a(s) and h(s) to indicate the authority and hub of s. Initial: a(s) = In(s),h(s) = Out(s) Iteration: a(s) = Σh(v), for ∀ v ∈ Bs. h(s) =Σa(v), for ∀ v ∈ Ns. Until achieve the convergence. IHITS(s) = a(s)。 2018/11/20

Relevant instance importance Assumption: each relevant instance is of the same importance. Large number. Sparse connections. For a term t. It.Rs = |t.Rs| / n, n is the total num. Furthermore, if t’s RDF sentence contains more than 1 triple: It.Rs = It.Rs / m, m = |t.s|. Therefore, the importance of a term It = αIt.s + (1-α)It.Rs . α is between [0,1]. 2018/11/20

Coverage-based re-rank Only considering importance may lead to information redundancy. 2 RDF sentence are inverse. (Record , isMadeBy, MusicArtist) vs (MusicArtist, made, Record) Over focused on few entities. A re-rank based on coverage, mainly 2 kinds of punishments. A very large punishment when inverse situation occurs. A medium punishment for each time the subject of a term’s RDF sentence has already occurred in the result sequence. 2018/11/20

Selecting It’s an open question to determine choose how many terms to comprise a summarization.. Let users decide: Simple. 10 or |S|(if |S| < 10) terms. S is the RDF sentence set. Medium. 20% * |S| or 10(if 20% * |S| <10) or |S|(if |S| < 10 ). Detailed. 50% * |S| or 10(if 50% * |S| <10) or |S|(if |S| < 10 ) 2018/11/20

System Architecture 2018/11/20

Outline Background and Problem Definition Methodology System Evaluation 2018/11/20

The WinRAR package The system is published by a WinRAR package. Users can download the package and execute the .jar file. The “model” folder contains some data models. Generated when each dataset is firstly summarized. To accelerate the future summarizations. 2018/11/20

The initial UI The initial UI: 2018/11/20

The file choosing UI After pressing “open” button: 2018/11/20

Cold boot time The X axis is the size of dataset and its unit is k-triples. The Y axis is the time needed and its unit is second. 2018/11/20

Main summary UI “Jamendo” is a music dataset mainly based in France. 2018/11/20

Main summary UI(Cont.) After changing α and scale. 2018/11/20

Main summary UI(Cont.) After choosing HITS. 2018/11/20

Term’s detail UI When choosing a term for further exploration. 2018/11/20

Outline Background and Problem Definition Methodology System Evaluation 2018/11/20

Evaluation It’s a hard task to evaluate a summary since one can’t build a universal standard. Evaluate the tool based on 2 aspects: Performance. Functionality. 2018/11/20

Performance evaluation Artificially generate a summary with 10 terms of Jamendo dataset as a so-called “Golden Standard”. Use it to evaluate the performance of summarization results. Performance here mainly means accuracy. The accuracy can be evaluated in 2 ways: Order-independent. Order-dependent. 2018/11/20

Order-independent P(t) = K∗θ. θ=1 if occurs else θ=0. 2018/11/20

Order-dependent P(t) = K∗θ∗(1−∆) ∆ = r t −o t /10. 2018/11/20

Functionality evaluation Evaluated by users’ feedbacks. An experimental dataset with 6000 RDF triples is built from LinkedMDB. Users only explore the result summary.   非常不同意 比较不同意 一般同意 较为同意 非常同意 摘要呈现简洁、清晰,容易理解 根据摘要,我能快速地理解数据集的主要内容 我可以仅根据摘要开发简单的数据集相关应用,而不需全面地浏览数据集 2018/11/20

Functionality evaluation(cont.) After exploring the whole experimental dataset.   非常不同意 比较不同意 一般同意 较为同意 非常同意 摘要正确地反映了数据集的主要内容 综合多个方面来考虑,该摘要是一份优秀的摘要 2018/11/20

Your questions are appreciated! Thanks!! Your questions are appreciated! 2018/11/20