Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket.

Similar presentations


Presentation on theme: "Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket."— Presentation transcript:

1 Document Clustering Carl Staelin

2 Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket of documents Humans look for patterns, and are good at pattern matching “Random” collections of documents don’t have a recognizable structure Clustering documents into recognizable groups makes it easier to see patterns Can rapidly eliminate irrelevant clusters

3 Lecture 7Information Retrieval and Digital LibrariesPage 3 Basic Idea Choose a document similarity measure Choose a cluster cost criterion

4 Lecture 7Information Retrieval and Digital LibrariesPage 4 Basic Idea Choose a document similarity measure Choose a cluster cost or similarity criterion Group like documents into clusters with minimal cluster cost

5 Lecture 7Information Retrieval and Digital LibrariesPage 5 Cluster Cost Criteria Sum-of-squared-error Cost =  i ||x i -x|| 2 Average squared distance Cost = (1/n 2 )  i  j ||x i -x j || 2

6 Lecture 7Information Retrieval and Digital LibrariesPage 6 Cluster Similarity Measure Measures the similarity of two clusters C i, C j 1. d min (C i, C j ) = min x i  C i,x j  C j ||x i – x j || 2. d max (C i, C j ) = max x i  C i,x j  C j ||x i – x j || 3. d avg (C i, C j ) = (1/ n i n j )  x i  C i  x j  C j ||x i – x j || 4. d mean (C i, C j ) = ||(1/n j )  x i  C i x i – (1/n j ) ,x j  C j x j || 5. …

7 Lecture 7Information Retrieval and Digital LibrariesPage 7 Iterative Clustering Assign points to initial k clusters Often this is done by random assignment Until done Select a candidate point x, in cluster c Find “best” cluster c’ for x If c  c’, then move x to c’

8 Lecture 7Information Retrieval and Digital LibrariesPage 8 Iterative Clustering The user must pre-select the number of clusters Often the “correct” number is not known in advance! The quality of the outcome is usually dependent on the quality of the initial assignment Possibly use some other algorithm to create a good initial assignment?

9 Lecture 7Information Retrieval and Digital LibrariesPage 9 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity

10 Lecture 7Information Retrieval and Digital LibrariesPage 10 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity

11 Lecture 7Information Retrieval and Digital LibrariesPage 11 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity

12 Lecture 7Information Retrieval and Digital LibrariesPage 12 Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering gives a hierarchy of clusters This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5

13 Lecture 7Information Retrieval and Digital LibrariesPage 13 High density variations Intuitively “correct” clustering

14 Lecture 7Information Retrieval and Digital LibrariesPage 14 High density variations Intuitively “correct” clustering HAC-generated clusters

15 Lecture 7Information Retrieval and Digital LibrariesPage 15 Hybrid Combine HAC and iterative clustering Assign points to initial clusters using HAC Until done Select a candidate point x, in cluster c Find “best” cluster c’ for x If c  c’, then move x to c’

16 Lecture 7Information Retrieval and Digital LibrariesPage 16 Other Algorithms Support Vector Clustering Information Bottleneck …

17 Lecture 7Information Retrieval and Digital LibrariesPage 17 High density variations


Download ppt "Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket."

Similar presentations


Ads by Google