A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc.

A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc. Philip S.Yu IBM Watson Research Center

Multi-type Relational Data (MTRD) is Everywhere!  Bibliometrics Papers, authors, journals  Social networks People, institutions, friendship links  Biological data Genes, proteins, conditions  Corporate databases Customers, products, suppliers, shareholders Papers Authors Key words

Challenges for Clustering!  Data objects are not identically distributed: Heterogeneous data objects (papers, authors).  Data objects are not independent Heterogeneous data objects are related to each other. No IID assumption 

Relational Data  Flat Data? Paper ID word1word2 …… author1author2 ……………………….……. 113……10 ………………..…….. …… ……. ………….…………………….…….. Author IDPaper 1Paper 2 ……………………….……. 110……………………..…….. …… …….…………………….……..  High dimensional and sparse data  Data redundancy Word ID Paper 1Paper 2 ……………………….……. 113……………………..…….. …… ……. ……………….…….. Papers Authors Key words

Relational Data  Flat Data?  No interactions of hidden structures of different types of data objects  Difficult to discover the global community structure. users Web pages queries

A General Model: Collective Factorization on Related Matrices  Formulate multi-type relational data as a set of related matrices;  cluster different types of objects simultaneously by factorizing the related matrices simultaneously.  Make use of the interaction of hidden structures of different types of objects.

Data Representation  Represent a MTRD set as a set of related matrices: Relation matrix, R (ij), denotes the relations between ith type of objects and jth type of objects. Feature matrix, F (i), denotes the feature values for ith type of objects. Users Movies Words Authors Papers f R (12) R (23) F (1)

Matrix Factorization  Exploring the hidden structure of the data matrix by its factorization:. Feature basis matrix Cluster association matrix

Model: Collective Factorization on Related Matrices (CFRM)

CFRM Model: Example 3 1 2 f

Spectral Clustering  Algorithms that cluster points using eigenvectors of matrices derived from the data  Obtain data representation in the low- dimensional space that can be easily clustered  Traditional spectral clustering focuses on homogeneous data

Main Theorem:

Algorithm Derivation: Iterative Updating where,

Spectral Relaxation  Apply real relaxation to C (p) to let it be an arbitrary orthornormal matrix.  By Ky-Fan Theorem, the optimal solution is given by the leading k p eigenvectors of M (p).

Spectral Relational Clustering (SRC)

Spectral Relational Clustering: Example  Update C (1) as k 1 leading eigenvectors of  Update C (2) as k 2 leading eigenvectors of  Update C (3) as k 3 leading eigenvectors of 3 1 2

Advantages of Spectral Relational Clustering (SRC)  Simple as traditional spectral approaches  Applicable to relational data with various structures.  Adaptive low dimension embedding  Efficient: O(tmn 2 k). For sparse data, it is reduced to O(tmzk) where z denotes the number of non-zero elements

Special case 1: k-means and spectral clustering  Flat data: a special MTRD with only one feature matrix F,  By the main theorem, k-means is equivalent to the trace maximization,

Special case 2: Bipartite Spectral Graph Partitioning (BSGP)  Bipartite graph: a special case of MTRD with one relation matrix R,  BSGP restricts the clusters of different types of objects to have one-to-one associations, i.e., diagonal constraints on A.

Experiments  Bi-type relational data: Document-word data  Tri-type relational data: Category-document-word data.  Comparing algorithms: Normalized Cut (NC), Bipartite Spectral Graph Partitioning (BSGP), Mutual Reinforcement K-means (MRK) Consistent Bipartite Graph Co-partitioning (CBGC).

Experimental Results on Bi-type Relational Data

Eigenvectors of a multi2 data set

Experimental Results on Tri-type Relational Data

Summary  Collective Factorization on Related Matrices – a general model for MTRD clustering.  Spectral Relational Clustering – A novel spectral approach Simple and applicable to relational data with various structures. Adaptive low dimension embedding Efficient  Theoretic analysis and experiments demonstrate the effectiveness and the promise of the model and of the algorithm.

A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc.

Similar presentations

Presentation on theme: "A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc.

Similar presentations

Presentation on theme: "A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc."— Presentation transcript:

Similar presentations

About project

Feedback