Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning
CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute. These patterns are then utilized to predict the values of the target attribute in future data instances. Unsupervised learning: The data have no target attribute. We want to explore the data to find some intrinsic structures in them.
Unsupervised Learning Unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible. As a simple example, you could imagine clustering customers by their demographics. The learning algorithm may help you discover distinct groups of customers by region, age ranges, gender and other attributes in such way that we can develop targeted marketing programs.
CS583, Bing Liu, UIC 4 What is Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters. the process of grouping a set of patterns into classes of similar objects Patterns within a cluster should be similar. Patterns from different clusters should be dissimilar.
Hard vs. soft clustering Hard clustering: Each pattern belongs to exactly one cluster More common and easier to do Soft clustering: A pattern can belong to more than one cluster.
CS583, Bing Liu, UIC 6 Aspects of clustering A clustering algorithm Partitional clustering Hierarchical clustering … A distance (similarity, or dissimilarity) function Clustering quality Inter-clusters distance maximized Intra-clusters distance minimized The quality of a clustering result depends on the algorithm, the distance function, and the application.
How do we define “similarity”? Recall that the goal is to group together “similar” data – but what does this mean? No single answer – it depends on what we want to find or emphasize in the data; this is one reason why clustering is an “art” The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice!
(Dis)similarity measures Instead of talking about similarity measures, we often equivalently refer to dissimilarity measures (I’ll give an example of how to convert between them in a few slides…) Jagota defines a dissimilarity measure as a function f(x,y) such that f(x,y) > f(w,z) if and only if x is less similar to y than w is to z This is always a pair-wise measure Think of x, y, w, and z as gene expression profiles (rows or columns)
Euclidean distance Here n is the number of dimensions in the data vector.
Desirable Properties of a Clustering Algorithm Ability to deal with different data types Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records Incorporation of user-specified constraints Interpretability and usability
CS583, Bing Liu, UIC 15 K-means clustering K-means is a partitional clustering algorithm Let the set of data points (or instances) D be {x 1, x 2, …, x n }, where x i = (x i1, x i2, …, x ir ) is a vector in a real-valued space X R r, and r is the number of attributes (dimensions) in the data. The k-means algorithm partitions the given data into k clusters. Each cluster has a cluster center, called centroid. k is specified by the user
CS583, Bing Liu, UIC 16 K-means algorithm Given k, the k-means algorithm works as follows: 1)Randomly choose k data points (seeds) to be the initial centroids, cluster centers 2)Assign each data point to the closest centroid 3)Re-compute the centroids using the current cluster memberships. 4)If a convergence criterion is not met, go to 2).
CS583, Bing Liu, UIC 17 K-means algorithm – (cont …)
CS583, Bing Liu, UIC 18 Stopping/convergence criterion 1. no (or minimum) re-assignments of data points to different clusters, 2. no (or minimum) change of centroids, or 3. Fixed number of iterations
Example
Strengths and Weakness Strengths: Simple: easy to understand and to implement Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations.
Weakness The user needs to specify k. The algorithm is sensitive to outliers Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values.
CS583, Bing Liu, UIC 26 Weaknesses of k-means: Problems with outliers
Hierarchical Clustering Produce a nested sequence of clusters, a tree, also called Dendrogram.
Hierarchical Clustering (cont.) This produces a binary tree or dendrogram The final cluster is the root and each data item is a leaf The height of the bars indicate how close the items are
29 Types of hierarchical clustering Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and merges the most similar (or nearest) pair of clusters stops when all the data points are merged into a single cluster (i.e., the root cluster). Divisive (top down) clustering: It starts with all data points in one cluster, the root. Splits the root into a set of child clusters. Each child cluster is recursively divided further stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point
30 Agglomerative clustering It is more popular then divisive methods. At the beginning, each data point forms a cluster (also called a node). Merge nodes/clusters that have the least distance. Go on merging Eventually all nodes belong to one cluster
31 Agglomerative clustering algorithm
32 An example: working of the algorithm