Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Advanced Multimedia Text Clustering Tamara Berg

Reminder - Classification Given some labeled training documents Determine the best label for a test (query) document

What if we don’t have labeled data? We can’t do classification.

What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.

What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. – Often similarity is assessed according to a distance measure.

What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. – Often similarity is assessed according to a distance measure. – Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

Any of the similarity metrics we talked about before (SSD, angle between vectors)

Document Clustering Clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar.

Source: Hinrich Schutze

Google news Flickr Clusters

How to cluster Documents

Reminder - Vector Space Model  Documents are represented as vectors in term space  A vector distance/similarity measure between two documents is used to compare documents Slide from Mitch Marcus

Document Vectors: One location for each word. novagalaxy heat h’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus

Document Vectors novagalaxy heat h’wood filmroledietfur 10 5 3 5 10 10 8 7 910 5 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI Document ids Slide from Mitch Marcus

TF x IDF Calculation Slide from Mitch Marcus W1W2W3…Wn A

Features F1F2F3…Fn A Define whatever features you like: Length of longest string of CAP’s Number of $’s Useful words for the task …

Similarity between documents A = [10 5 3 0 0 0 0 0]; G = [5 0 7 0 0 9 0 0]; E = [0 0 0 0 0 10 10 0]; Sum of Squared Distances (SSD) = SSD(A,G) = ? SSD(A,E) = ? SSD(G,E) = ? Which pair of documents are the most similar?

source: Dan Klein

K-means clustering Want to minimize sum of squared Euclidean distances between points x i and their nearest cluster centers m k source: Svetlana Lazebnik

source: Dan Klein

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. But we don’t know how long convergence will take! Source: Hinrich Schutze

Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. But we don’t know how long convergence will take! If we don’t care about a few docs switching back and forth, then convergence is usually fast (< 10-20 iterations). Source: Hinrich Schutze

source: Dan Klein

Hierarchical clustering strategies Agglomerative clustering Start with each point in a separate cluster At each iteration, merge two of the “closest” clusters Divisive clustering Start with all points grouped into a single cluster At each iteration, split the “largest” cluster source: Svetlana Lazebnik

source: Dan Klein

Divisive Clustering Top-down (instead of bottom-up as in Agglomerative Clustering) Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. Source: Hinrich Schutze

Flat or hierarchical clustering? For high efficiency, use flat clustering (e.g. k means) For deterministic results: hierarchical clustering When a hierarchical structure is desired: hierarchical algorithm Hierarchical clustering can also be applied if K cannot be predetermined (can start without knowing K) Source: Hinrich Schutze

For Thurs Read Chapter 6 of textbook

Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Similar presentations

Presentation on theme: "Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Similar presentations

Presentation on theme: "Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)"— Presentation transcript:

Similar presentations

About project

Feedback