Download presentation
Presentation is loading. Please wait.
1
Advanced Multimedia Text Clustering Tamara Berg
2
Reminder - Classification Given some labeled training documents Determine the best label for a test (query) document
3
What if we don’t have labeled data? We can’t do classification.
4
What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.
5
What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. – Often similarity is assessed according to a distance measure.
6
What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. – Often similarity is assessed according to a distance measure. – Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.
9
Any of the similarity metrics we talked about before (SSD, angle between vectors)
10
Document Clustering Clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar.
11
Source: Hinrich Schutze
16
Google news Flickr Clusters
17
Source: Hinrich Schutze
18
How to cluster Documents
19
Reminder - Vector Space Model Documents are represented as vectors in term space A vector distance/similarity measure between two documents is used to compare documents Slide from Mitch Marcus
20
Document Vectors: One location for each word. novagalaxy heat h’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus
21
Document Vectors novagalaxy heat h’wood filmroledietfur 10 5 3 5 10 10 8 7 910 5 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI Document ids Slide from Mitch Marcus
22
TF x IDF Calculation Slide from Mitch Marcus W1W2W3…Wn A
23
Features F1F2F3…Fn A Define whatever features you like: Length of longest string of CAP’s Number of $’s Useful words for the task …
24
Similarity between documents A = [10 5 3 0 0 0 0 0]; G = [5 0 7 0 0 9 0 0]; E = [0 0 0 0 0 10 10 0]; Sum of Squared Distances (SSD) = SSD(A,G) = ? SSD(A,E) = ? SSD(G,E) = ? Which pair of documents are the most similar?
25
Source: Hinrich Schutze
26
source: Dan Klein
27
K-means clustering Want to minimize sum of squared Euclidean distances between points x i and their nearest cluster centers m k source: Svetlana Lazebnik
28
K-means clustering Want to minimize sum of squared Euclidean distances between points x i and their nearest cluster centers m k source: Svetlana Lazebnik
40
source: Dan Klein
43
Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: Source: Hinrich Schutze
44
Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. Source: Hinrich Schutze
45
Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) Source: Hinrich Schutze
46
Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Source: Hinrich Schutze
47
Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. Source: Hinrich Schutze
48
Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. But we don’t know how long convergence will take! Source: Hinrich Schutze
49
Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. But we don’t know how long convergence will take! If we don’t care about a few docs switching back and forth, then convergence is usually fast (< 10-20 iterations). Source: Hinrich Schutze
50
source: Dan Klein
52
Source: Hinrich Schutze
54
Hierarchical clustering strategies Agglomerative clustering Start with each point in a separate cluster At each iteration, merge two of the “closest” clusters Divisive clustering Start with all points grouped into a single cluster At each iteration, split the “largest” cluster source: Svetlana Lazebnik
55
source: Dan Klein
57
Divisive Clustering Top-down (instead of bottom-up as in Agglomerative Clustering) Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. Source: Hinrich Schutze
58
Flat or hierarchical clustering? For high efficiency, use flat clustering (e.g. k means) For deterministic results: hierarchical clustering When a hierarchical structure is desired: hierarchical algorithm Hierarchical clustering can also be applied if K cannot be predetermined (can start without knowing K) Source: Hinrich Schutze
59
For Thurs Read Chapter 6 of textbook
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.