Gilad Lerman Math Department, UMN

Gilad Lerman Math Department, UMN
Clustering Gilad Lerman Math Department, UMN Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore

What is Clustering? Partitioning data into classes with
high intra-class similarity low inter-class similarity Is it well-defined?

What is Similarity? Clearly, subjective measure or problem-dependent

How Similar Clusters are?
Ex1: Two clusters or one clusters?

How Similar Clusters are?
Ex2: Cluster or outliers

Sum-Squares Intra-class Similarity
Given Cluster Mean: Within Cluster Sum of Squares: Note that

Within Cluster Sum of Squares
For Set of Clusters S={S1,…,SK} Can use So get Within Clusters Manhattan Distance Question: how to compute/estimate c?

Minimizing WCSS Precise minimization is “NP-hard”
Approximate minimization for WCSS by K-means Approximate minimization for WCMD by K-medians

The K-means Algorithm Input: Data & number of clusters (K)
Randomly guess locations of K cluster centers For each center – assign nearest cluster Repeat till convergence ….

Demonstration: K-means/medians
Applet

K-means: Pros and Cons Pros Often fast
Often terminates at a local minimum Cons May not obtain the global minimum Depends on initialization Need to specify K Sensitive to outliers Sensitive to variations in sizes and densities of clusters Not suitable for non-convex shapes Does not apply directly to categorical data

Spectral Clustering Idea: embed data for easy clustering
Construct weights based on proximity: (Normalize W ) Embed using eigenvectors of W

Clustering vs. Classification
Clustering – find classes in an unsupervised way (often K is given though) Classification – labels of clusters are given for some data points (supervised learning)

Data 1: Face images Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space” They are often well-separated so that simple clustering can apply to them (but not always…) Question: What is the high-dimensional image space? Question: How can we present high-dim. data in 3D?

Data 2: Iris Data Set 50 samples from each of 3 species
Setosa Versicolor Virginica Last figure can be confusing as often sepals are green, but in iris they are not 50 samples from each of 3 species 4 features per sample: length & width of sepal and petal

Data 2: Iris Data Set

Data 2: Iris Data Set Setosa is clearly separated from 2 others
Can’t separate Virginica and Versicolor (need training set as done by Fischer in 1936) Question: What are other ways to visualize?

Data 3: Color-based Compression of Images
Applet Question: What are the actual data points? Question: What does the error mean?

Some methods for # of Clusters (with online codes)
Gap statistics Model-based clustering G-means X-means Data-spectroscopic clustering Self-tuning clustering

Your mission Learn about clustering (theoretical results, algorithms, codes) Focus: methods for determining # of clusters Understand details Compare using artificial and real data Conclude good/bad scenarios for each (prove?) Come up with new/improved methods Summarize info: literature survey and possibly new/improved demos/applets We can suggest additional questions tailored to your interest

Gilad Lerman Math Department, UMN

Similar presentations

Presentation on theme: "Gilad Lerman Math Department, UMN"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gilad Lerman Math Department, UMN

Similar presentations

Presentation on theme: "Gilad Lerman Math Department, UMN"— Presentation transcript:

Similar presentations

About project

Feedback