Machine Learning on Data Lecture 9b- Clustering

Machine Learning on Data Lecture 9b- Clustering
CS 795/895 Introduction to Data Science Machine Learning on Data Lecture 9b- Clustering Dr. Sampath Jayarathna Cal Poly Pomona Credit for some of the slides in this lecture goes to Prof. Ray Mooney at UT Austin

Take-away today What is clustering?
Applications of clustering in information retrieval K-means algorithm Evaluation of clustering How many clusters? 2

Clustering: Definition
(Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 3

Data set with clear cluster structure
How would you design an algorithm for finding these three clusters? 4

Classification vs. Clustering
Classification: supervised learning Clustering: unsupervised learning Classification: Classes are human-defined and part of the input to the learning algorithm. Clustering: Clusters are inferred from the data without human input. However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . 5

Clustering in IR Result set clustering for better navigation
Yippy (formally Clusty): For grouping search results thematically 6

For improving search recall
Sec. 16.1 For improving search recall Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: When a query matches a doc D, also return other docs in the cluster containing D Hope if we do this: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.

Issues for Clustering Representation for clustering
Sec. 16.1 Issues for Clustering Representation for clustering Document representation Vector space? Normalization? Need a notion of similarity/distance How many clusters? Completely data driven? Avoid “trivial” clusters - too large or small In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.

Flat vs. Hierarchical clustering
Flat algorithms Usually start with a random (partial) partitioning of docs into groups Refine iteratively Main algorithm: K-means clustering Hierarchical algorithms Create a hierarchy Bottom-up, agglomerative Top-down, divisive 9

Hard vs. Soft clustering
Hard clustering: Each document belongs to exactly one cluster. More common and easier to do Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put sneakers in two clusters: sports apparel shoes 10

Flat algorithms Flat algorithms compute a partition of N documents into a set of K clusters. Given: a set of documents and the number K Find: a partition into K clusters that optimizes the chosen partitioning criterion Global optimization: exhaustively enumerate partitions, pick optimal one Effective heuristic method: K-means algorithm 11

K-means (Hard, flat clustering)
Perhaps the best known clustering algorithm Simple, works well in many cases Use as default / baseline for clustering documents Vector space model As in vector space classification, we measure relatedness between vectors by Euclidean distance . . . . . .which is almost equivalent to cosine similarity. 12

K-means Each cluster in K-means is defined by a centroid.
Objective/partitioning criterion: minimize the average squared difference from the centroid Definition of centroid: where we use ω to denote a cluster. We try to find the minimum average squared difference by iterating two steps: reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment 13

Example of k-means Start with random cluster centers C1 than to C2 x2

Example of k-means Identify the points that are closer to C1 than to C2 x2 x1 x4 x3 C1 C2 x5 x6 x7

Example of k-means Update C1 x2 x1 x4 x3 C1 C2 x5 x6 x7

Example of k-means Identify the points that are closer to C2 than to C1 x2 x1 x4 x3 C1 C2 x5 x6 x7

Example of k-means Identify the points that are closer to C2 than to C1 x2 x1 x4 x3 C1 x5 x6 C2 x7

Example of k-means Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2 x2 x1 x4 x3 C1 x5 x6 C2 x7

Example of k-means Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2 Update C1 and C2 x2 x1 C1 x4 x3 x5 x6 C2 x7

K-means for Clustering
Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers

Optimality of K-means K-means is guaranteed to converge
But we don’t know how long convergence will take! If we don’t care about a few docs switching back and forth, then convergence is usually fast (< iterations). However, complete convergence can take many more iterations. Convergence does not mean that we converge to the optimal clustering! This is the great weakness of K-means. If we start with a bad set of seeds, the resulting clustering can be horrible. 22

Initialization of K-means
Random seed selection is just one of many ways K-means can be initialized. Random seed selection is not very robust: It’s easy to get a suboptimal clustering. Better ways of computing initial centroids: Select seeds not randomly, but using some heuristic (e.g., document similar to any existing mean) Try out multiple starting points Initialize with the results of another method (Use hierarchical clustering to find good seeds) 23

What Is A Good Clustering?
Sec. 16.3 What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used

External criteria for clustering quality
Sec. 16.3 External criteria for clustering quality Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data Assesses a clustering with respect to ground truth … requires labeled data Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members.

External Evaluation of Cluster Quality
Sec. 16.3 External Evaluation of Cluster Quality Simple measure: purity, the ratio between the dominant class in the cluster cj and the size of cluster ωi

Example for computing purity
good_docs(ω1) = max(5,1,0) = 5 good_docs(ω2) = max(1,4,1) = 4 good_docs(ω3) = max(2,0,3) = 3 Purity(Ω) = (1/17) × ( ) = 12/17 ≈ 0.71. 27

Rand index Definition:
Based on 2x2 contingency table of all pairs of documents: TP+FN+FP+TN is the total number of pairs. There are pairs for N documents. Example: = 136 in o/⋄/x example Each pair is either positive or negative (the clustering puts the two documents in the same or in different clusters) . . . . . . and either “true” (correct) or “false” (incorrect): the clustering decision is correct or incorrect. 28

Rand Index: Example As an example, we compute RI for the o/⋄/x example. We first compute TP + FP. The three clusters contain 6, 6, and 5 points, respectively, so the total number of “positives” or pairs of documents that are in the same cluster is: Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄ pairs in cluster 3, and the x pair in cluster 3 are true positives: Thus, FP = 40 − 20 = 20. How to calculate FN and TN? 29

Rand measure for the o/⋄/x example
TP + FP + TN + FN = 136 TN + FN = = 96 Same Classes = TP + FN = = 44 FN = 44- TP = 24 TN = 96 – FN = 72 30

Rand measure for the o/⋄/x example
RI = ( )/( ) ≈ 0.68. 31

F measure F measure Like Rand, but “precision” and “recall” can be weighted P = tp/(tp + fp) = 20/40 = 0.5 R = tp/(tp + fn) = 20/44 = 0.45 Fβ=1 = 2*P*R/(P+R) = 0.45/0.95 = 0.47 32

Evaluation Results All 3 measures range from 0 (really bad clustering) to 1 (perfect clustering) 33

How many clusters? Number of clusters K is given in many applications.
What if there is no external constraint? Is there a “right” number of clusters? One way to go: define an optimization criterion Given docs, find K for which the optimum is reached. 34

State-of-the-art Clustering: Topic Models in Text Processing

Overview Motivation: Basic Assumptions: Applications
Model the topic/subtopics in text collections Basic Assumptions: There are k topics in the whole collection Each topic is represented by a multinomial distribution over the vocabulary (language model) Each document can cover multiple topics Applications Summarizing topics Predict topic coverage for documents Model the topic correlations Classification, Clustering

Basic Topic Models Unigram model Mixture of unigrams Probabilistic LSI
Latent Dirichlet Allocation (LDA) Correlated Topic Models

What is a “topic”? Representation: a probabilistic distribution over words. retrieval information 0.15 model query language feedback …… Topic: A broad concept/theme, semantically coherent, which is hidden in documents e.g., politics; sports; technology; entertainment; education etc.

Document as a mixture of topics
government 0.3 response Topic 1 [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … city 0.2 new orleans Topic 2 … How can we discover these topic-word distributions? Many applications would be enabled by discovering such topics Summarize themes/aspects Facilitate navigation/browsing Retrieve documents Segment documents Many other text mining tasks donate 0.1 relief 0.05 help Topic k is 0.05 the a Background k

Latent Dirichlet Allocation

Topics learned by LDA

Topic assignments in document
Based on the topics shown in last slide

Final word In clustering, clusters are inferred from the data without human input (unsupervised learning) However, in practice, it’s a bit less clear: there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents.

Machine Learning on Data Lecture 9b- Clustering

Similar presentations

Presentation on theme: "Machine Learning on Data Lecture 9b- Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning on Data Lecture 9b- Clustering

Similar presentations

Presentation on theme: "Machine Learning on Data Lecture 9b- Clustering"— Presentation transcript:

Similar presentations

About project

Feedback