Who Are Similar to Einstein: A Multi-Type Object Similarity Measure for Entity Recommendation Zheng Liang
Outline Introduction Similarity measures based on EMD Approaches to Entity Type Weighting Evaluation Summary
Introduction Today most users’ activities are pivoted around entities in Web search and browsing. In order to help users explore further, more and more online systems (such as Google, Yahoo!, and others) can identify the real-world entity, and provide recommendations of related entities based on the relationships in the knowledge base.
Introduction With the publish of a number of knowledge bases as Linked Data (such as Freebase2, DBpedia3, and others), we have the extremely valuable resources to be utilized. However, such knowledge bases have a large amount of related entities based on the relationships with the current entity. Therefore, it is difficult for the online system to find out and determine what users are looking for.
Introduction However, we know not only that the user’s initial understanding of entity can be uniquely linked to an entity type in a knowledge base, but also entity type is important and interesting facet of each entity. Here we focus on recommending the most relevant entities that are similar to the current entity type. The large-scale knowledge bases define a multitude of entity types. For example, the entity `Albert Einstein ' in DBpedia[ ] has 63 types, among which `Person' ,`JewishScientists', `NobelLaureatesInPhysics', and `ETHZurichAlumni' can be found.
Introduction Thus, there is a need for evaluating semantic similarity between the multi-type entities. In previous research, the objects being compared are often modeled as sets, with their similarity traditionally determined based on set intersection. Most existing similarity measures, such as the Cosine measure, the Dice measure, the Jaccard measure, the Overlap measure and the information-theoretic measure.
Introduction However, the above similarity measures cannot take into account structural similarity between objects by adding a hierarchy describing the relationships among domain elements. By exploiting hierarchical structure in some domain, such as WordNet, Cyc and so on, a variety of methods to measure semantic similarity/ distance between objects have been proposed. The main approaches, such as the Shortest Path Lengths, the Lowest Common Ancestor( LCA), are based on distance within an ontological structure or concept information content.
#Entity #Types Albert Einstein Person JewishScientists NobelLaureatesInPhysics ETHZurichAlumni Max Born Felix Bloch Marie Curie Scientist NobelLaureatesInChemistry Sim s(Einstein,Born) s(Einstein,Bloch) s(Einstein,Curie) Jaccard 0.5 Cosine-IDF 0.55 0.87 Cosine-LCA 0.79 0.66 0.68 The results demonstrate the difference of the three measures. But which one is more reasonable?
Introduction Obviously, measuring pairwise element similarity is important for computing similarity between the two collections. Actually, the extent of each element's importance in one collection plays a more crucial role, which represents the contribution or weight in computing the similarity between the two collections. It determines how “good” a “match” between the element of two collections is.
Introduction In this study we introduce a novel similarity measure based on the earth mover’s distance (EMD) [20], which not only takes into account pairwise element similarity, but also the weight of element. Here, the weight of entity type is the key factor in EMD. In this paper, we define the new task of entity type weighting, whose goal is to measure the importance of the entity type. We propose several methods for entity type weighting by exploiting the entity type hierarchy(e.g., the depth of ancestors of entity type), collection statistics(e.g.,IDF), and the graph structure(e.g., weighted PageRank)
Similarity measures based on EMD 0 s(txi, tyj ) 1 i wxi =1 j wyj =1 1 i m 1 j n Capacity Cost tx1 txi txm . ty1 tyj tyn X Y wx1 wxi wxm wy1 wyj wyn s(txi, tyj ) b (vxi , vyj ) = [bij]=1-s(txi, tyj ) 1 i m ; 1 j n ; 0 bij 1
Problem is formalized as follows:
Approaches to Entity Type Weighting We define the task of entity type weighting Given an entity e and its types Te = {t1, t2,…, tn} in the knowledge base, we define a type weighting function w(ti), ti Te . Let w(t1), w(t2),…, w(tn) [0,1] such that i w(ti)=1 w(ti)>w(tj) represent that the type ti is more important than the type tj among the entity types Te
Approaches to Entity Type Weighting Statistics-based Approach idf wxi =idf(txi ) / idf (txi ) txi X 0 wxi 1 Hierarchy-based Approach ANC_DEPTH wxi =ANC_DEPTH(txi ) / ANC_DEPTH (txi ) txi X
Approaches to Entity Type Weighting Weighted PageRank-based Approach There are some common sense approaches to the way of thinking, such as vertical thinking and horizontal thinking. In current context, the vertical thinking and horizontal thinking will be reflected in the cognitive entity type. Entity type graph is restructured and we newly define two kinds of edge: “Vertical Edge” and “Horizontal Edge”.
Approaches to Entity Type Weighting Weighted PageRank-based Approach t1 Vertical Edge Vertical Edge t2 tn Horizontal Edge Entity type graph is restructured
Approaches to Entity Type Weighting Weighted PageRank-based Approach Furthermore, Considering that when a user is navigating inside the entity type DAG, the user may have a preference on which kind of edge to follow. We define a Weighted Type Graph w(i, j) = p* vert(i, j) + (1-p)* hor(i, j) where vert(i, j) and hor(i, j) are 0 or 1, representing the existence of vertical or horizontal edge from i to j respectively, and p is the navigational preference of a surfer.
Approaches to Entity Type Weighting Weighted PageRank-based Approach We denote the measurement of entity type based on Weighted PageRank, as Cp . The Cp of each entity type can be computed as following, wxi = Cp(txi ) / Cp (txi ) txi X 0 wxi 1
The Experimental Setup EVALUATION The Experimental Setup DBpedia 4 Data sets(Scientist, Actor, Company, City) Data set #Entity #Types Max.Type Avg.Type Avg.Depth Scientist 9920 7980 55 14.328629 5.68 Actor 2244 1513 26 16.070856 5.22 Company 31096 9137 52 11.959127 6.71 City 13494 2596 17 10.809471 7.63
The Experimental Setup Case Study The two tasks: weight of entity type; Similar Type entity The four entities: (Einstein, Sydney, Jackie Chan, IBM ) Gold Standard the depth-10 pooling technique The 20 users, give ratings 3, 2 and 1 (“closely important/similar”, “somewhat important/similar” and “no important/similar”)
Evaluation Metrics
1: Type Weight NDCG@3 Albert Einstein Sydney Jackie Chan IBM IDF 0.4010352503843686 0.45740605904715065 0.49136513350960015 0.4347886049176016 ANC_DEPTH 0.7108654703178249 0.4701606425251276 0.5184331474616707 0.4339566908687293 WPR (p=0.2) 0.7477647758223119 0.44308336488840117 WPR (p=0.5) WPR (p=0.8) 0.5800016153724736 0.5989955621018614 0.46105136253703993
Analysis of the Results 观察1: 最终nDCG值,基于WPR方法的nDCG值高于IDF和ANC_DEPTH两个方法 ,验证 WPR方法的有效性 观察2:在基于WPR方法,导航概率p分别取0.2, 0.5, 0.8 三种不同的情况,随着p增加, nDCG值保持上升或稳定,得到的推断:用户对那种特殊的且临近该type周边丰富 做为 重要度的评价依据, 较符合用户的直觉。
2: Entity Recommendation Based on Similar Type NDCG@3 Albert Einstein Sydney Jackie Chan IBM Jaccard 0.6173634172 0.8046129061698147 0.6967307005844323 0.39664163797074165 Cosine-IDF 0.7462155048864013 0.791760695043853 0.3439765968668957 EMD Weight Cost 1/n Edit-distances 0.8308103365909342 0.9025183213513805 0.6296908825811813 LCA[1] 1 0.5450960508766485 IDF 0.4636814559739926 LCA [1] 0.5977610919804944 WPR 0.7988805459902472 0.9594535145926796 [1]Jiang J J, Conrath D W. Semantic similarity based on corpus statistics and lexical taxonomy[J]. arXiv preprint cmp-lg/9709008, 1997.
NDCG@5 Albert Einstein Sydney Jackie Chan IBM Jaccard 0.7925353824831283 0.7797481355613947 0.7790523571555098 0.43224719895879193 Cosine-IDF 0.8646508450697649 0.7448119036846329 0.6899624473586741 0.4912471141333091 EMD Weight Cost 1/n Edit-distances 0.8586296918245756 0.9659238902765536 0.7366173634617255 0.5810205889646239 LCA[1] 0.8692416898160293 0.8357926221960373 0.7820509230540151 IDF 0.5717903032384376 0.7854509758300989 0.6454174924602561 0.5604228010734621 LCA [1] 0.6765473050556958 0.8278815954653227 0.5647854111425092 WPR 0.7526317273206226 0.7811623183601433 0.7827344880884941 0.708512626044815 [1]Jiang J J, Conrath D W. Semantic similarity based on corpus statistics and lexical taxonomy[J]. arXiv preprint cmp-lg/9709008, 1997.
NDCG@10 Albert Einstein Sydney Jackie Chan IBM Jaccard 0.7983937697565586 0.7840496616487352 0.8390045106028717 0.6760959448747315 Cosine-IDF 0.9502454219871792 0.8410468031668601 0.6474201485048228 0.6347262357660418 EMD Weight Cost 1/n Edit-distances 0.9182318047645928 0.9142181639951701 0.8391398641661951 0.7357275672847454 LCA[1] 0.9081030793140307 0.8869856924025382 0.853142648772392 0.8073396705129684 IDF 0.6476567476942635 0.9286642051843628 0.7294801570674362 0.730098497264242 LCA [1] 0.7393879612055377 0.8895015723878011 0.6679687976891499 0.7257362148942302 WPR 0.8454701000691338 0.968993398369703 0.8401970866141474 0.7048479378594289 0.9502072438095415 0.9314469354384113 0.8854937913535869 0.8209491186905515 [1]Jiang J J, Conrath D W. Semantic similarity based on corpus statistics and lexical taxonomy[J]. arXiv preprint cmp-lg/9709008, 1997.
Analysis of the Results 观察1:采用EMD方法得到的nDCG值比传统方法基本上要高,(除了基于IDF的EMD方法)验证 EMD方法的有效性 观察2:基于1/n、IDF及WPR的的EMD方法 基于WPR的EMD方法比基于1/n及基于IDF的EMD方法得到的nDCG值基本上要高,验证基于WPR的EMD方法的有效性,结果符合人们的直觉 基于IDF的EMD方法得到nDCG值在某些情况下甚至传统方法还要低。得到的结论:权重在EMD方法中起到重要作用,不合理的权重分配,会起到反作用,导致比简单方法更差的结果
Summary In summary, the main contributions of this paper are: We introduce the multi-type object similarity measures based on EMD for similar entity recommendation, leading to similar entities that are more intuitive than the ones generated by traditional similarity measures. We define the task of weighting entity type, and develop a novel approach to type weighting, which mainly simulate a user’s walk on type graph.
不 足 需要进一步完善的 实验的评价度量单一(只有NDCG,添加其它一些度量AP…) 在Type Weight 实验中 NDCG@k (k=3,5,10,20) 在similar entity recommendation实验中 再添加一些传统度量进行比较 算法时间复杂度