RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Fast Algorithms For Hierarchical Range Histogram Constructions
Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
What is Cluster Analysis?
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Unsupervised Streaming Feature Selection in Social Media
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Exploring Social Tagging Graph for Web Object Classification
Support Feature Machine for DNA microarray data
Semi-Supervised Clustering
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Constrained Clustering -Semi Supervised Clustering-
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
Data Mining K-means Algorithm
CIKM’ 09 November 3rd, 2009, Hong Kong
Classification of unlabeled data:
Research in Computational Molecular Biology , Vol (2008)
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
Jianping Fan Dept of CS UNC-Charlotte
Clustering Using Pairwise Comparisons
Hierarchical clustering approaches for high-throughput data
Learning with information of features
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
Bayesian Models in Machine Learning
SMEM Algorithm for Mixture Models
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Jiawei Han Department of Computer Science
Graph Clustering Based on Structural/Attribute Similarities
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Korea University of Technology and Education
Topic models for corpora and for graphs
EM Algorithm and its Applications
Efficient Processing of Top-k Spatial Preference Queries
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presented by Nick Janus
Presentation transcript:

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, Tianyi Wu Department of Computer Science University of Illinois at Urbana-Champaign EDBT’09, St.-Petersburg, Russia, March 2009 11/17/2018

The RankClus Algorithm Experiments Conclusion Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018

Information Networks Are Ubiquitos Conference-Author Network Co-author Network 11/17/2018

Two Kinds of Information Networks Homogeneous Network Objects belong to single type E.g., co-author network, internet, friendship network, gene interaction network, and so on Most current studies are on homogeneous networks Heterogeneous Network Objects belong to several types E.g., conference-author network, paper-conference-author-topic network, movie-user network, webpage-tag-user network, and so on Most real networks are heterogeneous networks, and many homogeneous networks are extracted from a more complex network 11/17/2018

How to Better Understand Information Networks? Problem: Hard to understand large, raw networks Huge number of objects Links are “in a mess” Solution: Extracting aggregate information from networks Ranking Clustering 11/17/2018

Ranking Goal Evaluate importance of objects in the network A ranking function: map an object into a real non-negative score Algorithms PageRank (for homogeneous networks) HITS (for homogeneous networks) PopRank (for heterogeneous networks) 11/17/2018

Clustering Goal Group similar objects together and obtain the cluster label for each object Algorithms Spectral clustering: Min-Cut, N-Cut, and MinMax-Cut (for homogeneous networks) Density-based clustering: SCAN (for homogeneous networks) How to cluster heterogeneous networks? Use SimRank to first extract pair-wise similarity for target objects (but time complexity is high) Combined with spectral clustering 11/17/2018

The RankClus Algorithm Experiments Conclusion Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018

Why RankClus? More meaningful cluster Within each cluster, ranking score for every object is available as well More meaningful ranking Ranking within a cluster is more meaningful than in the whole network Address the problem of clustering in heterogeneous networks No need to compute pair-wise similarity of objects Mapping each object into a low measure space 11/17/2018

Global Ranking vs. Within-Cluster Ranking in a Toy Example Two areas: 10 conferences and 100 authors in each area 11/17/2018

Difficulties in Clustering Heterogeneous Networks What type of objects to be clustered? Clustering on one specific type of objects (called target objects): specified by user Clustering of target objects can induce a sub-network of the original network Efficient algorithm of clustering How to avoid calculating pair-wise similarities among target objects? 11/17/2018

The RankClus Algorithm Experiments Conclusion Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018

Algorithm Framework - Illustration Sub-Network Ranking Clustering 11/17/2018

Algorithm Framework—Philosophy Ranking and clustering can be mutually improved Ranking: Once a cluster becomes more accurate, ranking will be more reasonable for such a cluster and will be the distinguished feature of the cluster Clustering: Once ranking is more distinguished from each other, the clusters can be adjusted and get more accurate results Objects preserve similarity under new measure space E.g., consider VLDB and SIGMOD 11/17/2018

Algorithm Framework - Summary Step 0. Initialization Randomly partition target objects into K clusters Step 1. Ranking Ranking for each sub-network induced from each cluster, which serves as feature for each cluster Step 2. Generating new measure space Estimate mixture model coefficients for each target object Step 3. Adjusting cluster Step 4. Repeat Step 1-3 until stable 11/17/2018

Focus on A Bi-type Network Case Conference-author network, links can exist between Conference (X) and author (Y) Author (Y) and author (Y) Use W to denote the links and there weights W = 11/17/2018

Step 1: Feature Extraction — Ranking Simple Ranking Proportional to degree counting for objects E.g., number of publications of authors Considers only immediate neighborhood in the network Authority Ranking Extension to HITS in weighted bi-type network Rules: Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors 11/17/2018

Rules in Authority Ranking Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors 11/17/2018

Philosophy in Authority Ranking Ranking score propagated by iterations using rules 2 and 1, or rules 2 and 3 The authority ranking of X and Y turned out to be primary eigenvectors of some symmetric matrix Considers the impact from the overall network Should be better than simple ranking 11/17/2018

Example: Authority Ranking in the 2-Area Conference-Author Network Given the correct cluster, the ranking of authors are quite distinct from each other 11/17/2018

Step 2: Generate New Measure Space—A Naive Method Mapping target object to a K-dimensional vector directly by considering a sub-network induced by it r(Y|x) vs. r(Y|k) Cosine similarity or KL-Divergence can be used E.g., (cos(r(Y|x), r(Y|1)), …, cos(r(Y|x), r(Y|K))) 11/17/2018

Step 2: Generate New Measure Space—A Mixture Model Method Consider each target object’s links are generated under a mixture distribution of ranking from each cluster Consider ranking as a distribution: r(Y) → p(Y) Each target object xi is mapped into a K-vector (πi,k) Parameters are estimated using the EM algorithm Maximize the log-likelihood given all the observations of links 11/17/2018

Example: 2-D Coefficients in the 2-Area Conference-Author Network The conferences are well separated in the new measure space 11/17/2018

Step 3: Cluster Adjustment in New Measure Space Cluster center in new measure space Vector mean of objects in the cluster (K-dimensional) Cluster adjustment Distance measure: 1- Cosine similarity Assign to the cluster with the nearest center 11/17/2018

A Running Case Illustration for 2-Area Conf-Author Network Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Improved a little Two clusters are almost well separated Improved significantly Well separated Stable 11/17/2018

Ranking Function Analysis Why “Authority Ranking” is better than “Simple Ranking”? For authority ranking, each object’s score is determined by The number of objects linking to it The strength of these links (weight of link) The quality of these objects (score) For simple ranking, each object’s score is determined by The quality of these objects are equal 11/17/2018

Re-examine the Rules Rule 1: Highly ranked authors publish many papers in highly ranked conferences An author publishing many papers in junk conferences will be ranked low Rule 2: Highly ranked conferences attract many papers from many highly ranked authors A conference accepting most papers from lowly ranked authors will be ranked low Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors A highly ranked author in an area usually has many co-operations with others 11/17/2018

Why Better Ranking Function Derives Better Clustering? Consider the measure space generation process For naive method, highly ranked objects in a cluster play a more important role to decide a target object’s new measure For mixture model, the same Intuitively, if we can find the highly ranked objects in a cluster, equivalently, we get the right cluster 11/17/2018

Time Complexity Analysis At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters Ranking for sparse network ~O(|E|) Mixture model estimation ~O(K|E|+mK) Cluster adjustment ~O(mK^2) In all, linear to |E| ~O(K|E|) 11/17/2018

The RankClus Algorithm Experiments Conclusion Outline Background Motivation The RankClus Algorithm Experiments Conclusion 11/17/2018

Case Study: Dataset: DBLP All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007. Both conference-author relationships and co-author relationships are used. K=15 11/17/2018

Accuracy Study Dataset: synthetic dataset Simulate a bipartite network similar to conf-author network P: control the node number of attribute objects T: transition probability matrix, to control the overlap between clusters K: fix to 3 Generating parameters for the five synthetic datasets Data1: medium separated and medium density P = [1000, 1500, 2000], T = [0.8, 0.05, 0.15; 0.1, 0.8, 0.1; 0.1, 0.05, 0.85] Data2: medium separated and low density P = [800, 1300, 1200], T = [0.8, 0.05, 0.15; 0.1, 0,8, 0.1; 0.1, 0.05, 0.85] Data3: medium separated and high density P = [2000, 3000, 4000], Data4: highly separated and medium density T = [0.9, 0.05, 0.05; 0.05, 0.9, 0.05; 0.1, 0.05, 0.85] Data5: poorly separated and medium density T = [0.7, 0.15, 0.15; 0.15, 0.7, 0.15; 0.15, 0.15, 0.7] 11/17/2018

Accuracy Study (Cont.) 5 (synthetic) dataset settings, 4 methods For each setting, generate 10 datasets, run each method for each dataset 100 times RankClus with authority ranking is the best overall 11/17/2018

Efficiency Study Varying size of attribute type of objects (×2) 11/17/2018

The RankClus Algorithm Experiments Conclusions Outline Background Motivation The RankClus Algorithm Experiments Conclusions 11/17/2018

Conclusions A general framework is proposed in which ranking and clustering are successfully combined to analyze information networks Formally study how ranking and clustering can mutually reinforce each other in information network analysis A novel algorithm, RankClus, is proposed and its correctness and effectiveness are verified A thorough experimental study on both synthetic and real datasets in comparison with the state-of-the-art algorithms, and the experimental results demonstrate the accuracy and efficiency of RankClus 11/17/2018