Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Text Categorization.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Abstract Extracting a matte by previous approaches require the input image to be pre-segmented into three regions (trimap). This pre-segmentation based.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Distributed Representations of Sentences and Documents
Scalable Text Mining with Sparse Generative Models
Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine Students: Joshua O’
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Modeling Information Diffusion in Networks with Unobserved Links Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University.
LANGUAGE NETWORKS THE SMALL WORLD OF HUMAN LANGUAGE Akilan Velmurugan Computer Networks – CS 790G.
Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Assigning Global Relevance Scores to DBpedia Facts Philipp Langer, Patrick Schulze, Stefan George, Tobias Metzke, Ziawasch Abedjan, Gjergji Kasneci DESWeb.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Algorithmic Detection of Semantic Similarity WWW 2005.
Vector Space Models.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Informatics tools in network science
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Linked Data Profiling Andrejs Abele UNLP PhD Day Supervisor: Paul Buitelaar.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
1 New metrics for characterizing the significance of nodes in wireless networks via path-based neighborhood analysis Leandros A. Maglaras 1 Dimitrios Katsaros.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Clustering of Web pages
Probabilistic Data Management
and Knowledge Graphs for Query Expansion Saeid Balaneshinkordan
Martin Rajman, Martin Vesely
Graph Analysis by Persistent Homology
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Network Science: A Short Introduction i3 Workshop
Section 7.12: Similarity By: Ralucca Gera, NPS.
Wikitology Wikipedia as an Ontology
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Presented by Nick Janus
Presentation transcript:

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,

Overview 1.Linked Data (Motivation for the work) 2.Problem Definition 3.Approaches 4.Results

An example

Linked Data -connect related data that was not previously linked -practice for exposing, sharing, and connecting pieces of data and information How: -URI (Uniform Resource Identifier) -RDF (Resource Description Framework) (description of how to model/present the data)

Linked Data, tiny example

ResourcePredicate / PropertyResource / Literal

Linked Data, one dataset -Nodes are resources -Edges are relations -Edge Labels are properties

Linked Data cloud diagram

DBpedia DBpedia extracted the information from the infoboxes from the Wikipedia websites Resource Properties Literal en.wikipedia.org/wiki/University_of_LjubljanaLocationhttp://en.wikipedia.org/wiki/Ljubljana en.wikipedia.org/wiki/University_of_LjubljanaEstablished“1919”

DBpedia DBraw contains all the properties from all the infoboxes within the English Wikipedia articles DBmapped the properties are unified (mapped onto a DBpedia ontology). Semantic of properties: PlaceOfBirth = BirthPlace The data is much cleaner and is better structured than the raw properties dataset.

Freebase An entity graph of people, places and things, built by people. -Colloborative knowledge base -Property schemas -Google Knowledge graph

Scale of Datasets #nodes#edges#objects#propertiesavgDeg DBmapped5M17M2M DBraw11M47M3M Freebase 141M607M 23M DBpedia 3.7 version (additional properties and resources may be added in the meanwhile) Largest and most structured dataset (Large number of edges and objects, and relatively small number of properties) Mesy and noisy dataset (Large number of different properties because they are not unified )

Missing properties Problem: What are the missing properties for Fiat? For a given resource, we want a rank of missing properties by likelihood.

Approach -Similar objects -Measure of similarity -Neighborhood -Ranking function

Approach Ranking = weighted average of the k nearest-neighbor objects’ property frequency vectors. General framework (Kernel smoother): We can replace d with normalized kernel function. (More math on this topic is in the paper.) The function g(o) depends on the choice of measure of closeness d(o,o i ).

Evaluation protocol The evaluation procedure: 1.For a given object, we delete one or more of its properties, denoting (o, {p 1, …, p k } ) 2.Run the recommendation algorithm for the object 3.Compute several evaluation metrics

Evaluation metrics -Inverse rank (IRank) = -Top 5 = -Top 10 =

Measure of Closeness -Local Measures: local graph properties -Baselines: -Random Objects -Objects with Common Properties -Property Co-occurrence -Global Measures: global graph properties -Exogenous Measures: external information (text)

Local Graph Measures We focus on a local description, based on the property distributions: -PropertyCount -DirPropertyCount -NeighbDirProperyCount

Random objects Choose uniformly at random some number of objects in the network

Objects with common properties Take the objects which share a minimum number of properties with the query object The number of shared properties is taken as the weight for the object

Property Co-occurence Approximate resource similarities through property co-occurrence patterns Only pairwise co-occurrences are considered for the purposes of scalability and feasibility of estimation

Our method Each object is described by DirPropertyCount vector The similarity is determined by the computing the dot product between DirPropertyCount vectors

Comparison

Other Measure of Closeness -Local Measures: local graph properties -Baselines: -Random Objects -Objects with Common Properties -Property Co-occurrence -Global Measures: global graph properties -Exogenous Measures: external (no graph) information

Global Graph Measures We use two global measures of closeness based on graph geodesics and graph diffusion: (We treat the graph as a simple undirected graph. We also remove all the literals and constants from the set of nodes to remove unintuitive paths.) -Shortest path length -The length of a shortest path between two objects -We calculate the distances corresponding to the k nearest objects -Exponential diffusion kernel -Based on computing the matrix exponential of the graph adjacency matrix A -Parameter α controls how local/global the similarities are -Takes into account both the total number of paths between nodes as well as their respective lengths -Robust measure

Exogenous Measures -Independent of the graph structure -Rely on additional external information about the objects -Helpful for nodes with little connections in the graph Textual information: -For some of the objects, we have extended abstracts describing the objects -TF-IDF weighting + cosine similarity

Results - IRank

Results - Top10

In vs. Out properties

Deleting several properties Method: DirPropertyCount vector Dataset: DBraw We remove a fixed fraction of in and out properties

Degradation – nodes / edges The negative effect of deleting a fraction of edges or nodes from the network

Degradation – properties The effect of deleting K most frequent properties from the network

Conclusion -Method for predicting missing properties -Use kernel smoother -Measure similarity in a number of different ways: -Local properties -Global graph structure -External data (text) -Extensive experimentation -Investigate more on combining measures -More details about the research is in the paper: -Linked data: Predicting missing properties [machine learning] -Predicting Instance Properties in Linked Data [semantics of data]

Take home message -Big redundancy / regularity in the data -Local measures perform well -Scale changes the structure -> we need different method

What’s Your Message? Questions ?