RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, Tianyi Wu Department of Computer Science University of Illinois at Urbana-Champaign EDBT’09, St.-Petersburg, Russia, March 2009 11/17/2018
The RankClus Algorithm Experiments Conclusion Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018
Information Networks Are Ubiquitos Conference-Author Network Co-author Network 11/17/2018
Two Kinds of Information Networks Homogeneous Network Objects belong to single type E.g., co-author network, internet, friendship network, gene interaction network, and so on Most current studies are on homogeneous networks Heterogeneous Network Objects belong to several types E.g., conference-author network, paper-conference-author-topic network, movie-user network, webpage-tag-user network, and so on Most real networks are heterogeneous networks, and many homogeneous networks are extracted from a more complex network 11/17/2018
How to Better Understand Information Networks? Problem: Hard to understand large, raw networks Huge number of objects Links are “in a mess” Solution: Extracting aggregate information from networks Ranking Clustering 11/17/2018
Ranking Goal Evaluate importance of objects in the network A ranking function: map an object into a real non-negative score Algorithms PageRank (for homogeneous networks) HITS (for homogeneous networks) PopRank (for heterogeneous networks) 11/17/2018
Clustering Goal Group similar objects together and obtain the cluster label for each object Algorithms Spectral clustering: Min-Cut, N-Cut, and MinMax-Cut (for homogeneous networks) Density-based clustering: SCAN (for homogeneous networks) How to cluster heterogeneous networks? Use SimRank to first extract pair-wise similarity for target objects (but time complexity is high) Combined with spectral clustering 11/17/2018
The RankClus Algorithm Experiments Conclusion Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018
Why RankClus? More meaningful cluster Within each cluster, ranking score for every object is available as well More meaningful ranking Ranking within a cluster is more meaningful than in the whole network Address the problem of clustering in heterogeneous networks No need to compute pair-wise similarity of objects Mapping each object into a low measure space 11/17/2018
Global Ranking vs. Within-Cluster Ranking in a Toy Example Two areas: 10 conferences and 100 authors in each area 11/17/2018
Difficulties in Clustering Heterogeneous Networks What type of objects to be clustered? Clustering on one specific type of objects (called target objects): specified by user Clustering of target objects can induce a sub-network of the original network Efficient algorithm of clustering How to avoid calculating pair-wise similarities among target objects? 11/17/2018
The RankClus Algorithm Experiments Conclusion Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018
Algorithm Framework - Illustration Sub-Network Ranking Clustering 11/17/2018
Algorithm Framework—Philosophy Ranking and clustering can be mutually improved Ranking: Once a cluster becomes more accurate, ranking will be more reasonable for such a cluster and will be the distinguished feature of the cluster Clustering: Once ranking is more distinguished from each other, the clusters can be adjusted and get more accurate results Objects preserve similarity under new measure space E.g., consider VLDB and SIGMOD 11/17/2018
Algorithm Framework - Summary Step 0. Initialization Randomly partition target objects into K clusters Step 1. Ranking Ranking for each sub-network induced from each cluster, which serves as feature for each cluster Step 2. Generating new measure space Estimate mixture model coefficients for each target object Step 3. Adjusting cluster Step 4. Repeat Step 1-3 until stable 11/17/2018
Focus on A Bi-type Network Case Conference-author network, links can exist between Conference (X) and author (Y) Author (Y) and author (Y) Use W to denote the links and there weights W = 11/17/2018
Step 1: Feature Extraction — Ranking Simple Ranking Proportional to degree counting for objects E.g., number of publications of authors Considers only immediate neighborhood in the network Authority Ranking Extension to HITS in weighted bi-type network Rules: Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors 11/17/2018
Rules in Authority Ranking Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors 11/17/2018
Philosophy in Authority Ranking Ranking score propagated by iterations using rules 2 and 1, or rules 2 and 3 The authority ranking of X and Y turned out to be primary eigenvectors of some symmetric matrix Considers the impact from the overall network Should be better than simple ranking 11/17/2018
Example: Authority Ranking in the 2-Area Conference-Author Network Given the correct cluster, the ranking of authors are quite distinct from each other 11/17/2018
Step 2: Generate New Measure Space—A Naive Method Mapping target object to a K-dimensional vector directly by considering a sub-network induced by it r(Y|x) vs. r(Y|k) Cosine similarity or KL-Divergence can be used E.g., (cos(r(Y|x), r(Y|1)), …, cos(r(Y|x), r(Y|K))) 11/17/2018
Step 2: Generate New Measure Space—A Mixture Model Method Consider each target object’s links are generated under a mixture distribution of ranking from each cluster Consider ranking as a distribution: r(Y) → p(Y) Each target object xi is mapped into a K-vector (πi,k) Parameters are estimated using the EM algorithm Maximize the log-likelihood given all the observations of links 11/17/2018
Example: 2-D Coefficients in the 2-Area Conference-Author Network The conferences are well separated in the new measure space 11/17/2018
Step 3: Cluster Adjustment in New Measure Space Cluster center in new measure space Vector mean of objects in the cluster (K-dimensional) Cluster adjustment Distance measure: 1- Cosine similarity Assign to the cluster with the nearest center 11/17/2018
A Running Case Illustration for 2-Area Conf-Author Network Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Improved a little Two clusters are almost well separated Improved significantly Well separated Stable 11/17/2018
Ranking Function Analysis Why “Authority Ranking” is better than “Simple Ranking”? For authority ranking, each object’s score is determined by The number of objects linking to it The strength of these links (weight of link) The quality of these objects (score) For simple ranking, each object’s score is determined by The quality of these objects are equal 11/17/2018
Re-examine the Rules Rule 1: Highly ranked authors publish many papers in highly ranked conferences An author publishing many papers in junk conferences will be ranked low Rule 2: Highly ranked conferences attract many papers from many highly ranked authors A conference accepting most papers from lowly ranked authors will be ranked low Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors A highly ranked author in an area usually has many co-operations with others 11/17/2018
Why Better Ranking Function Derives Better Clustering? Consider the measure space generation process For naive method, highly ranked objects in a cluster play a more important role to decide a target object’s new measure For mixture model, the same Intuitively, if we can find the highly ranked objects in a cluster, equivalently, we get the right cluster 11/17/2018
Time Complexity Analysis At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters Ranking for sparse network ~O(|E|) Mixture model estimation ~O(K|E|+mK) Cluster adjustment ~O(mK^2) In all, linear to |E| ~O(K|E|) 11/17/2018
The RankClus Algorithm Experiments Conclusion Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018
Case Study: Dataset: DBLP All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007. Both conference-author relationships and co-author relationships are used. K=15 11/17/2018
Accuracy Study Dataset: synthetic dataset Simulate a bipartite network similar to conf-author network P: control the node number of attribute objects T: transition probability matrix, to control the overlap between clusters K: fix to 3 Generating parameters for the five synthetic datasets Data1: medium separated and medium density P = [1000, 1500, 2000], T = [0.8, 0.05, 0.15; 0.1, 0.8, 0.1; 0.1, 0.05, 0.85] Data2: medium separated and low density P = [800, 1300, 1200], T = [0.8, 0.05, 0.15; 0.1, 0,8, 0.1; 0.1, 0.05, 0.85] Data3: medium separated and high density P = [2000, 3000, 4000], Data4: highly separated and medium density T = [0.9, 0.05, 0.05; 0.05, 0.9, 0.05; 0.1, 0.05, 0.85] Data5: poorly separated and medium density T = [0.7, 0.15, 0.15; 0.15, 0.7, 0.15; 0.15, 0.15, 0.7] 11/17/2018
Accuracy Study (Cont.) 5 (synthetic) dataset settings, 4 methods For each setting, generate 10 datasets, run each method for each dataset 100 times RankClus with authority ranking is the best overall 11/17/2018
Efficiency Study Varying size of attribute type of objects (×2) 11/17/2018
The RankClus Algorithm Experiments Conclusions Outline Background Motivation The RankClus Algorithm Experiments Conclusions 11/17/2018
Conclusions A general framework is proposed in which ranking and clustering are successfully combined to analyze information networks Formally study how ranking and clustering can mutually reinforce each other in information network analysis A novel algorithm, RankClus, is proposed and its correctness and effectiveness are verified A thorough experimental study on both synthetic and real datasets in comparison with the state-of-the-art algorithms, and the experimental results demonstrate the accuracy and efficiency of RankClus 11/17/2018