CIKM’ 09 November 3rd, 2009, Hong Kong

Slides:



Advertisements
Similar presentations
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Advertisements

1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
Weiren Yu 1, Xuemin Lin 1, Wenjie Zhang 1, Ying Zhang 1 Jiajin Le 2, SimFusion+: Extending SimFusion Towards Efficient Estimation on Large and Dynamic.
CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.
School of Computer Science Carnegie Mellon University Duke University DeltaCon: A Principled Massive- Graph Similarity Function Danai Koutra Joshua T.
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
1 Finding Shortest Paths on Terrains by Killing Two Birds with One Stone Manohar Kaul (Aarhus University) Raymond Chi-Wing Wong (Hong Kong University of.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
On Node Classification in Dynamic Content-based Networks.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity.
Hongbo Deng, Michael R. Lyu and Irwin King
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU.
1 New metrics for characterizing the significance of nodes in wireless networks via path-based neighborhood analysis Leandros A. Maglaras 1 Dimitrios Katsaros.
Presented by Edith Ngai MPhil Term 3 Presentation
Impact of Interference on Multi-hop Wireless Network Performance
Exploring Social Tagging Graph for Web Object Classification
Finding Dense and Connected Subgraphs in Dual Networks
Machine Learning Clustering: K-means Supervised Learning
Sofus A. Macskassy Fetch Technologies
HITS Hypertext-Induced Topic Selection
Lecture 1: Introduction CS 765: Complex Networks
Challenges in Creating an Automated Protein Structure Metaserver
Analysis of Node Localizability in Wireless Ad-hoc Networks
Parallel Density-based Hybrid Clustering
Surviving Holes and Barriers in Geographic Data Reporting for
Compact Query Term Selection Using Topically Related Text
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
CS7280: Special Topics in Data Mining Information/Social Networks
Degree and Eigenvector Centrality
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Latent Space Model for Road Networks to Predict Time-Varying Traffic
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Centrality in Social Networks
Effective Social Network Quarantine with Minimal Isolation Costs
PERFORMANCE AND TALENT MANAGEMENT
KDD Reviews 周天烁 2018年5月9日.
Peer-to-Peer and Social Networks Fall 2017
Zhenjiang Lin, Michael R. Lyu and Irwin King
Algorithms for Budget-Constrained Survivable Topology Design
Jiawei Han Department of Computer Science
Graph Clustering Based on Structural/Attribute Similarities
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Asymmetric Transitivity Preserving Graph Embedding
Structure and Content Scoring for XML
15th Scandinavian Workshop on Algorithm Theory
Inductive Clustering: A technique for clustering search results Hieu Khac Le Department of Computer Science - University of Illinois at Urbana-Champaign.
Distance-Constraint Reachability Computation in Uncertain Graphs
Presentation transcript:

CIKM’ 09 November 3rd, 2009, Hong Kong P-Rank: A Comprehensive Structural Similarity Measure over Information Networks Peixiang Zhao, Jiawei Han, Yizhou Sun University of Illinois at Urbana-Champaign Presented by Prof. Hong Cheng, CUHK CIKM’ 09 November 3rd, 2009, Hong Kong

Outline Introduction & Motivation P-Rank Experimental Studies Formula Derivatives Computation Experimental Studies Future direction & Conclusion Nov. 3rd 2009 CIKM’09 Hong Kong 1 of 15

Introduction Information Networks (INs) Physical, conceptual, and human/societal entities Interconnected relationships among different entities INs are ubiquitous and form a critical component of modern information infrastructure The Web highway or urban transportation networks research collaboration and publication networks Biological networks social networks Information networks have graph as their underlying data model, which consist of the vertex set V and the edge set E. Nov. 3rd 2009 CIKM’09 Hong Kong 2 of 15

Problem Similarity computation on entities of INs How similar is webpage A with webpage B in the Web ? How similar is researcher A with researcher B in DBLP co-authorship network ? First of all, how to define “similarity” within a massive IN? Textual proximity of entity labels/contents Structural proximity conveyed through links! A good structural similarity measure in INs: SimRank (KDD’02) 1.Similiarty means how similar two nodes in the graph are. 2. Previous studies show that the structural similarity can generate more meaningful similarity results, in comparison with text based similarity measures, for example, string based edit distance, cosine measure, etc. The main reason is that, the structural similarity is more homogeneous and language independent. By making use of the link information in the graph, the structural similarity between two nodes in the graph can make use of the neighborhood structure to help reinforce the similarity belief. 3. SimRank is a de facto standard for structural similarity on large graphs. There are a lot of work done to extend SimRank in different scenarios. Nov. 3rd 2009 CIKM’09 Hong Kong 3 of 15

Why SimRank is not Enough? Philosophy two entities are similar if they are referenced by similar entities Potential problems Semantic incomplete Only partial structural information from in-link direction is considered during similarity computation Biased similarity results May fail in different IN settings ! Inefficient in computation Worst-case O(n4), can be improved to O(n3), where n is the number of vertices in the information network Note Simrank is defined in a recursive way. Because Simrank only consider those node who “cite” the vertex-pair (in-link information), while neglecting those vertices “cited” by the vertex-pair (out-link information), so the measure is somehow biased and the scenario when the vertex pair doesn’t have common in-link neighbors, their Simrank score is UNDEFINED. SimRank is EXTREMELY inefficient in large real information networks. Nov. 3rd 2009 CIKM’09 Hong Kong 4 of 15

Why SimRank is not Enough? (a) A Heterogeneous IN and Structural Similarity Scores These two diagrams show the infeasibility of SimRank, compared with P-Rank. The first diagram is a heterogeneous IN. It contains nodes belonging to different categories: conference (C), committee member (m1, m2, m3) and paper (p1, p2, p3, p4). The edges have different meanings, like Conference “invite” PC member and PC member “bid” for paper and paper is “reviewed” by PC member and paper is “accepted” by the conference. Note for vertex pair (m1, m2), (m1, m3) and (m2, m3), SimRank gives the same similarity score. It can not distinguish these three pairs. However, by making use of the structural information from both in-link and out-link sides, P-Rank can differentiate them successfully. Similar things occur for (p1, p3), (p3, p4) The second diagram is a homogeneous IN. It contains nodes belonging to the same category, for example, all nodes are papers and edges are citations. Similarly, SimRank cannot differentiate (p2, p3), (p3, p4). Even worse, for node pair (p4, p5), (P2, p5), the SimRank is UNAVAILABLE. Because the vertex pair do not have common in-link neighbors which can pass similarity scores to them. However, P-Rank can successfully measure the similarity for all vertex pairs. The main philosophy is that, P-Rank can jointly take into account of similarity flow from both directions and if one direction is unavailable (SimRank may fail), P-Rank can still work successfully. (b) A Homogeneous IN and Structural Similarity Scores Nov. 3rd 2009 CIKM’09 Hong Kong 5 of 15

P(enetrating)-Rank Philosophy: Two entities are similar, if Advantages they are referenced by similar entities they reference similar entities Advantages Semantic complete Structural information from both in-link and out-link directions are considered during similarity computation Robust in different IN settings A unified structural similarity framework SimRank is just a special case P-rank extends the definition by adding the similarity computation for out-link direction For robustness, Simrank may fail in some IN, like the previous example, however, P-Rank can be employed into different IN settings P-Rank is a general similarity framework and many structural similarity measures, including Simrank, are its special cases. (Will be explained in details later) Nov. 3rd 2009 CIKM’09 Hong Kong 6 of 15

P-Rank Formula The structural similarity between vertex a and vertex b (a ≠ b), s(a, b): Recursive form Approximate iterative form In-link similarity Out-link similarity For the recursive form, it is a linear combination of similarity computation from both sides. Lambda is a parameter controlling the relative weight between in-link side and out-link side. C is a dampen factor. I(a) is all the nodes with a link to a and O(a) are all nodes pointed by a For the iterative form, k is the iterative number. It means that the similarity score of iteration k+1 can be computation based on the similarity score in iteration k. Nov. 3rd 2009 CIKM’09 Hong Kong 7 of 15

P-Rank Property The iterative P-Rank has the following properties: Symmetry: sk(a, b) = sk(b, a) Monotonicity: 0 ≤ sk(a, b) ≤ sk+1(a, b) ≤ 1 Existence: The solution to the iterative P-Rank formula always exists and converges to a fixed point, s(∗, ∗), which is the theoretical solution to the recursive P-Rank formula Uniqueness: the solution to the iterative P-Rank formula is unique when C ≠ 1 The theoretical solution to P-Rank can be reached by a repetitive computation via the iterative form The property of P-Rank secures that we can use the iterative computation to approach its theoretical real value, and in practice, the convergence is pretty fast, usually the iteration number is around 5 or 6. Nov. 3rd 2009 CIKM’09 Hong Kong 8 of 15

P-Rank Derivatives P-Rank proposes a unified structural similarity framework, upon which many structural similarity measures are just its special cases Cocitation is a structural similarity measure, intuitively for a and b, cocitation is the number of nodes which point to both a and b. It is the one-step P-Rank considering only in-link directions. Compling is the duality of Cocitation. It only consider the out-link direction. It can be regarded as one-step P-Rank considering out-link direction only. Amsler considers both in-link and out-link but only direct neighbors without similarity score propagation throughout the whole network. It is still a one-step P-Rank Simrank considers recursive computation but one direction is considered. Rvs-Simrank is the duality of Simrank and only outlink is considered. Nov. 3rd 2009 CIKM’09 Hong Kong 9 of 15

P-Rank Computation An iterative algorithm is executed until it reaches the fixed point Space complexity: O(n2) Time complexity: O(n4), can be improved to O(n3) by amortization Approximation algorithms on different IN scenarios Homogeneous IN Radius based pruning: vertex-pairs beyond a radius of r are no longer considered in similarity computation Heterogeneous IN Category based pruning: vertex-pairs in different categories are no longer considered in similarity computation N is the number of nodes in the graph The amortization algorithm is published in VLDB’08 Approximation algorithms are pretty straightforward. Nov. 3rd 2009 CIKM’09 Hong Kong 10 of 15

Experimental Studies Data sets: Methods Metrics Heterogeneous IN: DBLP (paper, author, conference, year) Homogeneous IN: DBLP (paper with citation), Synthetic data R-MAT Methods P-Rank SimRank Metrics Compactness of clusters Algorithmic nature Ground truth 1. The compactness of clusters can be intuitively defined as intra-cluster distance / inter-cluster distance, so the smaller, the better. 2. Algorithm nature is to test the Prank algorithm for different aspects: how fast it will converge, how it correlate with different parameters C and lambda. 3. Ground truth results are top-10 ranking results by making use of Prank as the underlying similarity measures in DBLP. Nov. 3rd 2009 CIKM’09 Hong Kong 11 of 15

Compactness of Clusters P-Rank and SimRank are used as underlying similarity measures, respectively, and K-Medoids are used to cluster different vertices Compactness: intra-cluster distance/inter-cluster distance Here, Cp stands for “compactness for Prank” and Cs stands for “compactness for Simrank”. The smaller the compactness, the better the clustering results. Heterogeneous IN Homogeneous IN Nov. 3rd 2009 CIKM’09 Hong Kong 12 of 15

Algorithmic Nature Iterative P-Rank converges fast to the fixed point These two diagram shows that the P-rank algorithms converges pretty fast. For the first diagram, when C is set small, it converges faster. When C is set large, it will need more iterations to converge. The reason is that the larger C is, the smaller the loss of similarity belief during similarity propagation. For the second diagram, it shows the relative importance of similarity scores for both in-link and out-link side. It is true that out-link plays more role in similarity computation. P-Rank v.s. the damping factor C P-Rank v.s. lambda Nov. 3rd 2009 CIKM’09 Hong Kong 13 of 15

Ground Truth Ranking Result Top-10 ranking results for author vertices in DBLP by P-Rank This results are pretty intuitive. The above one show the most similar PAIRS and the bottom two are top-10 rankings for different queries. Nov. 3rd 2009 CIKM’09 Hong Kong 14 of 15

Conclusion The proliferation of information networks calls for effective structural similarity measures in Ranking Clustering Top-k Query Processing …… Compared with SimRank, P-Rank is witnessed to be a more effective structural similarity measure in large information networks Semantic complete, general, robust, and flexible enough to be employed in different IN settings Nov. 3rd 2009 CIKM’09 Hong Kong 15 of 15

Thank you CIKM’ 09 November 3rd, 2009, Hong Kong