Download presentation
Presentation is loading. Please wait.
1
Gilad Lerman Math Department, UMN
Clustering Gilad Lerman Math Department, UMN Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore
2
What is Clustering? Partitioning data into classes with
high intra-class similarity low inter-class similarity Is it well-defined?
3
What is Similarity? Clearly, subjective measure or problem-dependent
4
How Similar Clusters are?
Ex1: Two clusters or one clusters?
5
How Similar Clusters are?
Ex2: Cluster or outliers
6
Sum-Squares Intra-class Similarity
Given Cluster Mean: Within Cluster Sum of Squares: Note that
7
Within Cluster Sum of Squares
For Set of Clusters S={S1,…,SK} Can use So get Within Clusters Manhattan Distance Question: how to compute/estimate c?
8
Minimizing WCSS Precise minimization is “NP-hard”
Approximate minimization for WCSS by K-means Approximate minimization for WCMD by K-medians
9
The K-means Algorithm Input: Data & number of clusters (K)
Randomly guess locations of K cluster centers For each center – assign nearest cluster Repeat till convergence ….
10
Demonstration: K-means/medians
Applet
11
K-means: Pros and Cons Pros Often fast
Often terminates at a local minimum Cons May not obtain the global minimum Depends on initialization Need to specify K Sensitive to outliers Sensitive to variations in sizes and densities of clusters Not suitable for non-convex shapes Does not apply directly to categorical data
12
Spectral Clustering Idea: embed data for easy clustering
Construct weights based on proximity: (Normalize W ) Embed using eigenvectors of W
13
Clustering vs. Classification
Clustering – find classes in an unsupervised way (often K is given though) Classification – labels of clusters are given for some data points (supervised learning)
14
Data 1: Face images Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space” They are often well-separated so that simple clustering can apply to them (but not always…) Question: What is the high-dimensional image space? Question: How can we present high-dim. data in 3D?
15
Data 2: Iris Data Set 50 samples from each of 3 species
Setosa Versicolor Virginica Last figure can be confusing as often sepals are green, but in iris they are not 50 samples from each of 3 species 4 features per sample: length & width of sepal and petal
16
Data 2: Iris Data Set
17
Data 2: Iris Data Set Setosa is clearly separated from 2 others
Can’t separate Virginica and Versicolor (need training set as done by Fischer in 1936) Question: What are other ways to visualize?
18
Data 3: Color-based Compression of Images
Applet Question: What are the actual data points? Question: What does the error mean?
19
Some methods for # of Clusters (with online codes)
Gap statistics Model-based clustering G-means X-means Data-spectroscopic clustering Self-tuning clustering
20
Your mission Learn about clustering (theoretical results, algorithms, codes) Focus: methods for determining # of clusters Understand details Compare using artificial and real data Conclude good/bad scenarios for each (prove?) Come up with new/improved methods Summarize info: literature survey and possibly new/improved demos/applets We can suggest additional questions tailored to your interest
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.