Unsupervised learning introduction Clustering Unsupervised learning introduction Machine Learning
Supervised learning Training set:
Unsupervised learning Training set:
Applications of clustering Market segmentation Social network analysis Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison) Swap: market seg and organize clusters Organize computing clusters Astronomical data analysis
Clustering vARIANT
Clustering Category Based on the Clustering Algorithms, clustering are categorized into Four Major Category: Partitional (Centroid Based) Try to cluster data into k number of cluster. Example: K-Means, K-Means++, Fuzzy C-Means. Hierarchical Agglomerative Start with all data as an individual cluster Divisive Start with the entire data as a single cluster.
Distribution Based The clustering model most closely related to statistics is based on distribution models. Example: EM-clustering Unpopular because tend to overfitting Density Based In density-based clustering, clusters are defined as areas of higher density than the remainder of the data set.
Based on the data Clustering are categorized into: Numerical data clustering Categorical data clustering
Clustering K-means algorithm
Get rid of the legacy points
Get rid of the legacy points
K-means algorithm Input: (number of clusters) Training set (drop convention)
Randomly initialize cluster centroids Repeat { for = 1 to K-means algorithm Randomly initialize cluster centroids Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster } Replace with normal text, size with LATEX fonts
K-means for non-separated clusters T-shirt sizing Weight Height
Optimization objective Clustering Optimization objective Machine Learning
K-means optimization objective = index of cluster (1,2,…, ) to which example is currently assigned = cluster centroid ( ) = cluster centroid of cluster to which example has been assigned Optimization objective: Change numbers to LATEX as well
:= index (from 1 to ) of cluster centroid closest to for = 1 to K-means algorithm Randomly initialize cluster centroids Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster } Replace as previous; change spacing to fill page
Random initialization Clustering Random initialization Machine Learning
:= index (from 1 to ) of cluster centroid closest to for = 1 to K-means algorithm Randomly initialize cluster centroids Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster } Replace as previous; change spacing to fill page
Random initialization Should have Randomly pick training examples. Set equal to these LATEX font
Local optima
Random initialization For i = 1 to 100 { Randomly initialize K-means. Run K-means. Get . Compute cost function (distortion) } Pick clustering that gave lowest cost
Choosing the number of clusters Clustering Choosing the number of clusters Machine Learning
What is the right value of K?
Choosing the value of K Elbow method: Cost function Cost function (no. of clusters) (no. of clusters)
Choosing the value of K Sometimes, you’re running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later purpose. E.g. T-shirt sizing T-shirt sizing Weight Weight Height Height