September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento.

Slides:

Advertisements

Similar presentations

ISDSI 2009 Francesco Guerra– Università di Modena e Reggio Emilia 1 DB unimo Searching for data and services F. Guerra 1, A. Maurino 2, M. Palmonari.

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Clustering Categorical Data The Case of Quran Verses

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Fast Algorithms For Hierarchical Range Histogram Constructions

Comparison Methodologies. Evaluating the matching characteristics Properties of the similarity measure Robustness of the similarity measure – Low variation.

Lectures on Network Flows

1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.

Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

D2I Modena, 27 Aprile 2001 Methodologies and techniques for the extraction, the representation and the integration of structured and semi-structured information.

HCS Clustering Algorithm

Testing Metric Properties Michal Parnas and Dana Ron.

CS5371 Theory of Computation Lecture 1: Mathematics Review I (Basic Terminology)

D2I Modena, 27 Aprile 2001 Methodologies and techniques for translating information from source to target data models Unità Responsabile: CS-RC Unità Coinvolte:

SEQUOIAS YR-SOC'07 - Leicester June A NOVEL APPROACH TO WEB SERVICES DISCOVERY Marco Comerio Università di Milano-Bicocca

1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Data Flow Analysis Compiler Design Nov. 8, 2005.

Priority Models Sashka Davis University of California, San Diego June 1, 2003.

The Shortest Path Problem

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,

GRAPH Learning Outcomes Students should be able to:

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Università degli Studi di Modena and Reggio Emilia Dipartimento di Ingegneria dell’Informazione Prototypes selection with.

A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.

Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

Querying Structured Text in an XML Database By Xuemei Luo.

RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Dimitrios Skoutas Alkis Simitsis

Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.

Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.

Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.

 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.

1 Closures of Relations: Transitive Closure and Partitions Sections 8.4 and 8.5.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Algorithmic Detection of Semantic Similarity WWW 2005.

Ontology Mapping in Pervasive Computing Environment C.Y. Kong, C.L. Wang, F.C.M. Lau The University of Hong Kong.

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Copyright © Cengage Learning. All rights reserved. Fundamental Concepts of Algebra 1.2 Exponents and Radicals.

Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Graph Indexing From managing and mining graph data.

Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.

1 GRAPH Learning Outcomes Students should be able to: Explain basic terminology of a graph Identify Euler and Hamiltonian cycle Represent graphs using.

Of 24 lecture 11: ontology – mediation, merging & aligning.

Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.

Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.

Minimum Spanning Tree 8/7/2018 4:26 AM

Computing Full Disjunctions

Business Process Measures

Lectures on Network Flows

Latent Semantic Indexing

1.1 Real Numbers.

A Semantic Peer-to-Peer Overlay for Web Services Discovery

Recuperação de Informação B

Presentation transcript:

September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento di Informatica, Matematica, Elettronica e Trasporti Università “Mediterranea” di Reggio Calabria International Conference on Cooperative Information Systems (CoopIS 2001)

Scheme Match: Finding a mapping between those elements of two schemes that semantically correspond to each other Applications: information source integration, e-commerce, scheme evaluation and migration, data and web warehousing, information source design and so on The need of semi- automatic techniques for carrying out this task is nowadays recognized Most of the techniques for Scheme Match proposed in the literature have been designed only for databases Motivations They aimed at deriving terminological and structural relationships between single concepts

New approaches to Scheme Match, handling semi-structured information sources, appear to be compulsory Such approaches must be somehow different from the traditional ones since: in semi-structured information sources significant pieces of information are expressed in the form of groups of concepts rather than single ones different instances of the same concept could have different structures The emphasis shifts away from the extraction of semantic correspondencies between concepts to the derivation of semantic correspondencies between groups of concepts Motivations

We propose a semi- automatic technique for extracting similarities between sub-sources belonging to different, heterogeneous and semi- structured information sources The adoption of a conceptual model, capable to uniformly handle information sources of different formats, appears to be extremely useful Translation rules should be defined from classical information source formats to the adopted conceptual model Our approach exploits the SDR-Network conceptual model which meets the requirements described above General characteristics of the approach

Given an information source IS, the number of possible sub-sources that can be derived from it is extremely high In order to avoid handling huge numbers of sub- source pairs, we propose an heuristic technique for singling out only the most promising ones After that the most promising pairs of sub- sources have been selected, their similarity degree must be computed The similarity degree associated to each pair of sub-sources is determined by computing the objective function associated to a maximum weight matching General characteristics of the approach SS i can be detected to be similar to SS j only if it possible to single out concepts of SS i and SS j that are pairwise similar in their turn

The SDR-Network and its metrics have been already exploited for defining a technique for deriving synonymies and homonymies In the whole, we propose a unified, semi-automatic approach for deriving concept synonymies and homonymies, as well as sub-source similarities This is particularly interesting since: We are proposing the derivation of a property which, generally, is not handled by most of the approaches for Scheme Match proposed in the literature The technique proposed here is part of a more general framework for deriving various kinds of terminological and structural properties General characteristics of the approach

The SDR-Network conceptual model Given an information source IS, the associated SDR-Network Net(IS) is Net(IS) = NS(IS) represents the set of nodes; each node is characterized by a name AS(D) denotes a set of arcs; each arc can be represented by a triplet S is the source node T is the target node L ST = [d ST, r ST ] is a label associated with the arc

The SDR-Network conceptual model d ST is the semantic distance coefficient: –it indicates how much the concept expressed by T is semantically close to the concept expressed by S –this depends from the capability of the concept associated with T to characterize the concept associated with S r ST is the semantic relevance coefficient: it indicates the fraction of instances of the concept denoted by S whose complete definition requires at least one instance of the concept represented by T

The SDR-Network conceptual model The Path Semantic Distance PSD P of a path P in Net(IS) is the sum of the semantic distance coefficients associated with the arcs included in the path The Path Semantic Relevance PSR P of a path P in Net(IS) is the product of the semantic relevance coefficients associated with the arcs included in the path The CD-Shortest-Path (Conditional D-Shortest-Path) between two nodes N and N’ in Net(IS) and including an arc A (denoted by  N, N’  A ) is the path having the minimum Path Semantic Distance among those connecting N and N’ and including A A D-Path n is a path P in Net(IS) such that n  PSD P < n+1 The i-th neighborhood of an SDR-Network node x is: nbh(x,i) = {A|A  AS(IS), A=,  x,y  A is a D_Path i, x  y} i  0

The number of possible sub- sources that can be identified in IS is exponential in the number of nodes of Net(IS) We have defined a technique for singling out the most promising pairs of sub- sources The proposed technique receives two information sources IS 1 and IS 2 and a Dictionary SD of Synonymies between nodes of Net(IS 1 ) and Net(IS 2 ) Synonymies are represented in SD by tuples of the form, where N i and N j are the synonym nodes and f ij is a coefficient in the real interval [0,1], indicating the similarity degree of N i and N j Selection of promising pairs of sub-sources

The technique works according to the following rules: —It considers those pairs of sub-sources [SS i, SS j ] such that SS i  Net(IS 1 ) is a rooted sub-net having a node N i as root, SS j  Net(IS 2 ) is a rooted sub-net having a node N j as root, N i and N j are interesting synonyms i.e., the synonym coefficient associated with them is greater than a certain threshold —It computes the maximum weight matching on some suitable bipartite graphs obtained from the target nodes of the arcs included in the neighborhoods of N i and N j —Given a pair of synonym nodes N i and N j, it derives a promising pair of sub-sources [SS i k,SS j k ], for each k such that both nbh(N i,k) and nbh(N j,k) are not empty —SS i k and SS j k are constructed by determining the promising pairs of arcs [A i k,A j k ] such that A i k  nbh(N i,l), A j k  nbh(N j,l), for each l belonging to the integer interval [0,k]

Selection of promising pairs of sub-sources —A pair of arcs [A i k,A j k ] is considered promising if An edge between the target nodes T i k of A i k and T j k of A j k is present in the maximum weight matching computed on a suitable bipartite graph constructed from the target nodes of the arcs of nbh(N i,l) and nbh(N j,l) for some l belonging to the integer interval [0,k] The similarity degree of T i k and T j k is greater than a certain given threshold The rationale underlying this approach is that of constructing promising pairs of sub-sources such that each pair consists in the maximum possible number of pairs of concepts whose synonymy has been already stated

Selection of promising pairs of sub-sources Theorem Let IS 1 and IS 2 be two information sources and let Net(IS 1 ) and Net(IS 2 ) be the corresponding SDR-Networks. Let n c 1 (resp., n c 2 ) be the number of complex nodes of Net(IS 1 ) (resp., Net(IS 2 )). Let l be the maximum neighborhood index associated with a node of Net(IS 1 ) or Net(IS 2 ). Then the number of possible pairs of sub-sources is min(n c 1,n c 2 )x(l+1) Actually, in real applications, the number of promising pairs of sub-sources relative to IS 1 and IS 2 is, generally, far lesser than min(n c 1,n c 2 )x(l+1)

Example The SDR-Network of the European Social Funds (ESF) information source

Example The SDR-Network of European Community Projects (ECP) information source

Example The Synonymy Dictionary associated with ESF and ECP

Example The interesting pairs of synonym nodes are {,, } As an example, consider the pair of synonym nodes Project [ESF] and Project [ECP] Since the neighborhoods of Project [ESF] and Project [ECP] are both not empty only for k=0, k=1 and k=2, our technique obtains three promising pairs of sub-sources relative to Project [ESF] and Project [ECP] In order to provide an example of the behaviour of our technique, we show the derivation of the promising pairs of sub-sources associated with Project [ESF] and Project [ECP] for k=0

Example The bipartite graph and the associated maximum weight matching are

Example The technique selects only those arcs of nbh(Project [ESF],0) and nbh(Project [ECP],0) which participate to the matching and have a similarity coefficient greater than a certain given threshold The promising pair of sub-sources associated with nbh(Project [ESF],0) and nbh(Project [ECP],0) is [SS 1, SS 2 ]: SS 1 = {, Project [ESF], Type [ESF], [0, 0.9]>,, } SS 2 = {, Project [ECP], Type [ECP], [0, 0.6]>,, } The technique works analogously for k=1 and k=2 as well as for the other interesting synonym pairs

Derivation of sub-source similarities The technique for deriving sub-source similarities from a given pair of information sources consists of two steps The first one computes the similarity degree relative to each promising pair of sub-sources derived previously The second one constructs a Sub-source Similarity Dictionary SSD by selecting only those pairs of sub-sources whose similarity degree is greater than a certain, dinamically computed, threshold More formally, the technique can be encoded as follows: SSD =  (  (SPS,SD)) where: SPS is the set of promising pairs of sub-sources SD is the Synonymy Dictionary

Derivation of sub-source similarities For each promising pair of sub-sources SS i and SS j, the function  calls a function  ’ which computes the corresponding similarity degree SSS =  (SPS, SD) = { | [SS i, SS j ]  SPS} The function  ’ receives a rooted sub-net SS and returns the nodes of SS The function  ’ derives the similarity degree associated with SS i and SS j by computing a suitable objective function associated with the maximum weight matching on a bipartite graph, constructed from the nodes of SS i and SS j  ’ (T,P,Q) = (1 – ((|P|+|Q|-2|E’|)/(|P|+|Q|)) x (  ’(E’)/|E’|)

Derivation of sub-source similarities The function  is called for constructing the Sub-source Similarity Dictionary SSD by taking those similarities of SSS having a coefficient greater than a certain, dynamically computed, threshold SSD =  (SSS) = { |  SSS, f ij >th Sim } Here th Sim is the dinamically computed threshold th Sim = min ((F Max +F Min )/2, th M ) where F Max is the maximum coefficient associated with the similarities of SSS F Min is the minimum coefficient associated with the similarities of SSS th M is a limit threshold value

Example Consider the SDR-Networks ESF and ECP SSD =  (  (SPS,SD)) As for the pair of sub-sources [SS 1, SS 2 ]  SPS derived previously,  calls  ’(SD,  ’(SS 1 ),  ’(SS 2 )) The bipartite graph and the associated maximum weight matching relative to  ’(SS 1 ) and  ’(SS 2 ) are

Example The objective function associated to the maximum weight matching is (1 – ((5+5-2*5)/10))*( )/5=0.93 In the same way the similarity degrees associated with all the other promising pairs of sub-sources are obtained Then SSS is provided in input to the function  which constructs the Sub- source Similarity Dictionary SSD SSD is determined by selecting those triplets of SSS whose similarity coefficient is greater than th Sim In this example all similarities of SSS are valid and SSD = SSS

Sub-source similarities can be exploited in several contexts All applications of Scheme Match relative to synonymies between single concepts can be extended to similarities between sub-sources In particular, sub-source similarities can be exploited for: Applications Information Source Integration E-commerce Semantic Query Processing Data and Web Warehouse Source clustering and cataloguing

We have presented a semi- automatic technique for deriving similarities of sub-sources belonging to information sources having different formats The technique is based on a conceptual model, called SDR-Network, which allows to uniformly represent information sources of different formats It consists of two steps: the first one selects a set of promising pairs of sub-sources, whereas the second one computes a similarity degree to associate with each pair of the set We have pointed out that the derivation of sub-source similarities is a special case of the more general problem of Scheme Match Conclusions Finally, we have illustrated a set of applications which could benefit of sub-source similarities

Present and Future Work We have already designed an approach which exploits sub-source similarities for carrying out information source integration In the future we plan to: Develop techniques which exploit sub-source similarities in the other possible application contexts we have previously mentioned Define techniques for deriving other terminological and structural properties in the context of semi-structured information sources

For more information... Domenico Ursino Dipartimento di Informatica, Matematica, Elettronica, Trasporti Università Mediterranea di Reggio Calabria Web: