Jianping Fan Dept of CS UNC-Charlotte

Jianping Fan Dept of CS UNC-Charlotte
Spectral Clustering Jianping Fan Dept of CS UNC-Charlotte

Key issues for Data Clustering
Similarity or distance function Inter-cluster similarity or distance Intra-cluster similarity or distance Number of clusters K Decision for data clustering Objective Function Inter-cluster distances are maximized Intra-cluster distances are minimized

Summary of K-means Problems of K-means Centers: random & density scan
K: start from small K & separate iteratively; start from large K and merge sequentially Outliers: Problems of K-means Locations of Centers Number of Clusters K Sensitive to Outliers Data Manifolds (Shapes of Data Distributions) Experiences

Problems of K-MEANs Distance Function Optimization Step:
Inter-cluster distances are maximized Intra-cluster distances are minimized Distance Function Geometry Distance Optimization Step: Assignment Step:

Problems of K-MEANs Similarity function cannot handle special data manifold effectively! Intra-cluster similarity and inter-cluster similarity are not optimized jointly or simultaneously! Pre-selected locations of cluster centers may not be acceptable!

K-Means Clustering Expected Achieved Why K-Means fails?

Why K-Means Clustering Fails?
Expected Achieved Similarity or distance function Inter-cluster similarity or distance Intra-cluster similarity or distance Number of clusters K Decision for data clustering Objective Function

Achieved Expected Number of clusters K may not be an issue here Objective function?

Expected Achieved Data Manifold: Relationship rather than distance Distance Function & Decision for Data Clustering

Key issues for Data Clustering
Inter-cluster similarity or distance Intra-cluster similarity or distance Number of clusters K Decision for data clustering Similarity or distance function

Lecture Outline Motivation Graph overview and construction
Spectral Clustering Cool implementations

Spectral Clustering Example – 2 Spirals
Dataset exhibits complex cluster shapes K-means performs very poorly in this space due bias toward dense spherical clusters. Relationship vs. Geometry Distance In the embedded space given by two leading eigenvectors, clusters are trivial to separate.

Spectral Clustering Relationship Similarity representation
Inter-cluster similarity Intra-cluster similarity Number of clusters K Decision for clustering Relationship Objective Function

Graph-Based Similarity Representation ---considering data manifold
Geometry Distance Relationship vs.

Spectral Clustering Example
Why k-means fails? Geometry vs. Manifold

Graph-Based Similarity Representation
Distance vs. Relationship

Graph-Based Similarity Representation
Number of clusters matters

Spectral Clustering Cool implementation

Graph-based Representation of Data Similarity(Relationship)

Graph-based Representation of Data Relationship

Manifold (Shape of Data Distribution)

Graph-based Representation of Data Relationships
Manifold

How to generate such graph for data relationship representation?

Data Graph Construction

Graph Cut

Spectral Clustering---considering intra-cluster similarity and inter-cluster similarity jointly! Cool implementations

Relationship function for Graph construction
Key issues for Spectral Clustering Relationship function for Graph construction Inter-cluster similarity or distance Intra-cluster similarity or distance Number of clusters K Decision for data clustering Objective Function

How to Do Graph Partitioning?
Citation Group Identification

Social Group Identification

Hot Topic Detection

Intra-cluster similarity

Spectral Clustering cut Intra-Cluster Similarity:
Inter-Cluster Similarity:

Spectral Clustering Graphcut Objective Function for Spectral Clustering 1. Maximize Intra-Cluster Similarity 2. Minimize Inter-Cluster Similarity

Objective Function for Spectral Clustering
Graphcut Objective Function for Spectral Clustering Min

Spectral Clustering Graphcut
Clustering via Graph Cut on weak connection points: Minimize inter-cluster similarity

Inter-cluster similarity

Graph Cut

Eigenvectors & Eigenvalues

Normalized Cut A graph G(V, E) can be partitioned into two disjoint sets A, B Cut is defined as: Optimal partition of the graph G is achieved by minimizing the cut Min ( )

Normalized Cut Normalized Cut
Association between partition set and whole graph

Normalized Cut

Normalized Cut Normalized Cut becomes
Normalized cut can be solved by eigenvalue equation:

Extending Binary Normalized Cut to Multi-Class

K-way Min-Max Cut Intra-cluster similarity Inter-cluster similarity Decision function for spectral clustering Minimize inter-cluster similarity but maximizing intra-cluster similarity

Mathematical Description of Spectral Clustering
Refined decision function for spectral clustering We can further define:

Refined decision function for spectral clustering
This decision function can be solved as

Spectral Clustering Algorithm Ng, Jordan, and Weiss
Motivation Given a set of points We would like to cluster them into k subsets

Algorithm Form the affinity matrix Define if
Scaling parameter chosen by user Define D a diagonal matrix whose (i,i) element is the sum of A’s row i

Algorithm Form the matrix Find , the k largest eigenvectors of L
These form the the columns of the new matrix X Note: have reduced dimension from nxn to nxk

Algorithm Form the matrix Y Treat each row of Y as a point in
Renormalize each of X’s rows to have unit length Y Treat each row of Y as a point in Cluster into k clusters via K-means

Algorithm Final Cluster Assignment
Assign point to cluster j iff row i of Y was assigned to cluster j

Why? If we eventually use K-means, why not just apply K-means to the original data? This method allows us to cluster non-convex regions

Some Examples

User’s Prerogative Affinity matrix construction
Choice of scaling factor Realistically, search over and pick value that gives the tightest clusters Choice of k, the number of clusters Choice of clustering method

How to select k? Eigengap: the difference between two consecutive eigenvalues. Most stable clustering is generally given by the value k that maximises the expression Largest eigenvalues of Cisi/Medline data λ1 λ2 Choose k=2

Recap – The bottom line

Summary Spectral clustering can help us in hard clustering problems
The technique is simple to understand The solution comes from solving a simple algebra problem which is not hard to implement Great care should be taken in choosing the “starting conditions”

Problems for Spectral Clustering
Number of Clusters K Objective Function Optimization Better Similarity (Relationship) Functions

What’s Visual Analytics?
Initial Clustering Result & Visualization

Initial Clustering Result & Visualization Similarity-preserving data projection: from high-dimensional space for data representation to 2D space for visualization Data layout Mistakes induced by data projection

Human Advising via HCI

Computer Interpretation of Human Advices Must-Link vs. Not-Link Data Clustering with Constraints

Jianping Fan Dept of CS UNC-Charlotte

Similar presentations

Presentation on theme: "Jianping Fan Dept of CS UNC-Charlotte"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jianping Fan Dept of CS UNC-Charlotte

Similar presentations

Presentation on theme: "Jianping Fan Dept of CS UNC-Charlotte"— Presentation transcript:

Similar presentations

About project

Feedback