Data Mining: Principles and Algorithms Introduction to Network Analysis Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2012 Jiawei Han. All rights reserved. Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen
Introduction to Network Analysis Measure and Metrics of Networks Mining Information Network 2 2
Measure & Metrics Degree Centrality Eigen Vector Centrality Not all neighbors are equal Katz Centrality PageRank 两个假设: 1. 在web图中,如果一个页面节点的入度越大,则表示这个页面越重要 2. 指向页面A的入链的权重不同,质量越高的页面会给A贡献越大的权重 HITS 分为Authority页面与Hub页面;Authority页面指高质量的网页,而Hub为指向authority的那些网页 1. 一个好的Authority页面会被很多Hub页面引用 2. 一个好的Hub页面会指向很多好的Authority页面
Measure & Metrics (2): Eigen Vector Centrality
Measure & Metrics (3): Katz Centrality
Introduction to Network Analysis Measure and Metrics of Networks Mining Information Network 6 6
What Are Information Networks? Information network: A network where each node represents an entity (e.g., actor in a social network) and each link (e.g., tie) a relationship between entities Each node/link may have attributes, labels, and weights Link may carry rich semantic information Homogeneous vs. heterogeneous networks Homogeneous networks Single object type and single link type Single model social networks (e.g., friends) WWW: a collection of linked Web pages Heterogeneous, multi-typed networks Multiple object and link types Medical network: patients, doctors, disease, contacts, treatments Bibliographic network: publications, authors, venues 7
Clustering and Ranking: Two Critical Functions H F J I B D Clustering Ranking Not distinguishing objects in each cluster? 1 2 3 4 5 A C E H 1 2 3 4 5 B D I F J 1 2 3 4 5 6 7 8 9 10 A C E B D G I H F J G Yelp: department store vs. restaurant A C A better solution: Integrating clustering with ranking Comparing apples and oranges? I E G B D J H F
RankClus: Integrating Clustering with Ranking Simple solution: Project the bi-typed network into homogeneous conference network? Information-loss projection!
A New Methodology: RankClus Ranking as the feature of the cluster Ranking is conditional on a specific cluster E.g., VLDB’s rank in Theory vs. its rank in the DB area The distributions of ranking scores over objects are different in each cluster Clustering and ranking are mutually enhanced Better clustering: rank distributions for clusters are more distinguishing from each other Better ranking: better metric for objects is learned from the ranking Not every object should be treated equally in clustering! Y. Sun, et al., “RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis”, EDBT'09.
Simple Ranking vs. Authority Ranking Proportional to # of publications of an author / a conference Considers only immediate neighborhood in the network Authority Ranking: More sophisticated “rank rules” are needed Propagate the ranking scores in the network over different types What about an author publishing 100 papers in very weak conferences?
Rules for Authority Ranking Rule 1: Highly ranked authors publish many papers in highly ranked conferences Rule 2: Highly ranked conferences attract many papers from many highly ranked authors Rule 3: The rank of an author is enhanced if he or she co-authors with many highly ranked authors
RankClus: Algorithm Framework Sub-Network Ranking Clustering Initialization Randomly partition Repeat Ranking Ranking objects in each sub-network induced from each cluster Generating new measure space Estimate mixture model coefficients for each target object Adjusting cluster Until stable
RankClus: Clustering & Ranking CS Conferences Top-10 conferences in 5 clusters using RankClus in DBLP (when k = 15) RankClus outperforms spectral clustering [Shi and Malik, 2000] algorithms on projected homogeneous networks