RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis
Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, Tianyi Wu Department of Computer Science University of Illinois at Urbana-Champaign EDBT’09, St.-Petersburg, Russia, March 2009 11/17/2018

The RankClus Algorithm Experiments Conclusion
Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018

Information Networks Are Ubiquitos
Conference-Author Network Co-author Network 11/17/2018

Two Kinds of Information Networks
Homogeneous Network Objects belong to single type E.g., co-author network, internet, friendship network, gene interaction network, and so on Most current studies are on homogeneous networks Heterogeneous Network Objects belong to several types E.g., conference-author network, paper-conference-author-topic network, movie-user network, webpage-tag-user network, and so on Most real networks are heterogeneous networks, and many homogeneous networks are extracted from a more complex network 11/17/2018

How to Better Understand Information Networks?
Problem: Hard to understand large, raw networks Huge number of objects Links are “in a mess” Solution: Extracting aggregate information from networks Ranking Clustering 11/17/2018

Ranking Goal Evaluate importance of objects in the network
A ranking function: map an object into a real non-negative score Algorithms PageRank (for homogeneous networks) HITS (for homogeneous networks) PopRank (for heterogeneous networks) 11/17/2018

Clustering Goal Group similar objects together and obtain the cluster label for each object Algorithms Spectral clustering: Min-Cut, N-Cut, and MinMax-Cut (for homogeneous networks) Density-based clustering: SCAN (for homogeneous networks) How to cluster heterogeneous networks? Use SimRank to first extract pair-wise similarity for target objects (but time complexity is high) Combined with spectral clustering 11/17/2018

Why RankClus? More meaningful cluster
Within each cluster, ranking score for every object is available as well More meaningful ranking Ranking within a cluster is more meaningful than in the whole network Address the problem of clustering in heterogeneous networks No need to compute pair-wise similarity of objects Mapping each object into a low measure space 11/17/2018

Global Ranking vs. Within-Cluster Ranking in a Toy Example
Two areas: 10 conferences and 100 authors in each area 11/17/2018

Difficulties in Clustering Heterogeneous Networks
What type of objects to be clustered? Clustering on one specific type of objects (called target objects): specified by user Clustering of target objects can induce a sub-network of the original network Efficient algorithm of clustering How to avoid calculating pair-wise similarities among target objects? 11/17/2018

Algorithm Framework - Illustration
Sub-Network Ranking Clustering 11/17/2018

Algorithm Framework—Philosophy
Ranking and clustering can be mutually improved Ranking: Once a cluster becomes more accurate, ranking will be more reasonable for such a cluster and will be the distinguished feature of the cluster Clustering: Once ranking is more distinguished from each other, the clusters can be adjusted and get more accurate results Objects preserve similarity under new measure space E.g., consider VLDB and SIGMOD 11/17/2018

Algorithm Framework - Summary
Step 0. Initialization Randomly partition target objects into K clusters Step 1. Ranking Ranking for each sub-network induced from each cluster, which serves as feature for each cluster Step 2. Generating new measure space Estimate mixture model coefficients for each target object Step 3. Adjusting cluster Step 4. Repeat Step 1-3 until stable 11/17/2018

Focus on A Bi-type Network Case
Conference-author network, links can exist between Conference (X) and author (Y) Author (Y) and author (Y) Use W to denote the links and there weights W = 11/17/2018

Step 1: Feature Extraction — Ranking
Simple Ranking Proportional to degree counting for objects E.g., number of publications of authors Considers only immediate neighborhood in the network Authority Ranking Extension to HITS in weighted bi-type network Rules: Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors 11/17/2018

Rules in Authority Ranking
Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors 11/17/2018

Philosophy in Authority Ranking
Ranking score propagated by iterations using rules 2 and 1, or rules 2 and 3 The authority ranking of X and Y turned out to be primary eigenvectors of some symmetric matrix Considers the impact from the overall network Should be better than simple ranking 11/17/2018

Example: Authority Ranking in the 2-Area Conference-Author Network
Given the correct cluster, the ranking of authors are quite distinct from each other 11/17/2018

Step 2: Generate New Measure Space—A Naive Method
Mapping target object to a K-dimensional vector directly by considering a sub-network induced by it r(Y|x) vs. r(Y|k) Cosine similarity or KL-Divergence can be used E.g., (cos(r(Y|x), r(Y|1)), …, cos(r(Y|x), r(Y|K))) 11/17/2018

Step 2: Generate New Measure Space—A Mixture Model Method
Consider each target object’s links are generated under a mixture distribution of ranking from each cluster Consider ranking as a distribution: r(Y) → p(Y) Each target object xi is mapped into a K-vector (πi,k) Parameters are estimated using the EM algorithm Maximize the log-likelihood given all the observations of links 11/17/2018

Example: 2-D Coefficients in the 2-Area Conference-Author Network
The conferences are well separated in the new measure space 11/17/2018

Step 3: Cluster Adjustment in New Measure Space
Cluster center in new measure space Vector mean of objects in the cluster (K-dimensional) Cluster adjustment Distance measure: 1- Cosine similarity Assign to the cluster with the nearest center 11/17/2018

A Running Case Illustration for 2-Area Conf-Author Network
Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Improved a little Two clusters are almost well separated Improved significantly Well separated Stable 11/17/2018

Ranking Function Analysis
Why “Authority Ranking” is better than “Simple Ranking”? For authority ranking, each object’s score is determined by The number of objects linking to it The strength of these links (weight of link) The quality of these objects (score) For simple ranking, each object’s score is determined by The quality of these objects are equal 11/17/2018

Re-examine the Rules Rule 1: Highly ranked authors publish many papers in highly ranked conferences An author publishing many papers in junk conferences will be ranked low Rule 2: Highly ranked conferences attract many papers from many highly ranked authors A conference accepting most papers from lowly ranked authors will be ranked low Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors A highly ranked author in an area usually has many co-operations with others 11/17/2018

Why Better Ranking Function Derives Better Clustering?
Consider the measure space generation process For naive method, highly ranked objects in a cluster play a more important role to decide a target object’s new measure For mixture model, the same Intuitively, if we can find the highly ranked objects in a cluster, equivalently, we get the right cluster 11/17/2018

Time Complexity Analysis
At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters Ranking for sparse network ~O(|E|) Mixture model estimation ~O(K|E|+mK) Cluster adjustment ~O(mK^2) In all, linear to |E| ~O(K|E|) 11/17/2018

Case Study: Dataset: DBLP
All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007. Both conference-author relationships and co-author relationships are used. K=15 11/17/2018

Accuracy Study Dataset: synthetic dataset
Simulate a bipartite network similar to conf-author network P: control the node number of attribute objects T: transition probability matrix, to control the overlap between clusters K: fix to 3 Generating parameters for the five synthetic datasets Data1: medium separated and medium density P = [1000, 1500, 2000], T = [0.8, 0.05, 0.15; 0.1, 0.8, 0.1; 0.1, 0.05, 0.85] Data2: medium separated and low density P = [800, 1300, 1200], T = [0.8, 0.05, 0.15; 0.1, 0,8, 0.1; 0.1, 0.05, 0.85] Data3: medium separated and high density P = [2000, 3000, 4000], Data4: highly separated and medium density T = [0.9, 0.05, 0.05; 0.05, 0.9, 0.05; 0.1, 0.05, 0.85] Data5: poorly separated and medium density T = [0.7, 0.15, 0.15; 0.15, 0.7, 0.15; 0.15, 0.15, 0.7] 11/17/2018

Accuracy Study (Cont.) 5 (synthetic) dataset settings, 4 methods
For each setting, generate 10 datasets, run each method for each dataset 100 times RankClus with authority ranking is the best overall 11/17/2018

Efficiency Study Varying size of attribute type of objects (×2)
11/17/2018

The RankClus Algorithm Experiments Conclusions
Outline Background Motivation The RankClus Algorithm Experiments Conclusions 11/17/2018

Conclusions A general framework is proposed in which ranking and clustering are successfully combined to analyze information networks Formally study how ranking and clustering can mutually reinforce each other in information network analysis A novel algorithm, RankClus, is proposed and its correctness and effectiveness are verified A thorough experimental study on both synthetic and real datasets in comparison with the state-of-the-art algorithms, and the experimental results demonstrate the accuracy and efficiency of RankClus 11/17/2018

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

Similar presentations

Presentation on theme: "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

Similar presentations

Presentation on theme: "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,"— Presentation transcript:

Similar presentations

About project

Feedback