Document Clustering Carl Staelin
Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket of documents Humans look for patterns, and are good at pattern matching “Random” collections of documents don’t have a recognizable structure Clustering documents into recognizable groups makes it easier to see patterns Can rapidly eliminate irrelevant clusters
Lecture 7Information Retrieval and Digital LibrariesPage 3 Basic Idea Choose a document similarity measure Choose a cluster cost criterion
Lecture 7Information Retrieval and Digital LibrariesPage 4 Basic Idea Choose a document similarity measure Choose a cluster cost or similarity criterion Group like documents into clusters with minimal cluster cost
Lecture 7Information Retrieval and Digital LibrariesPage 5 Cluster Cost Criteria Sum-of-squared-error Cost = i ||x i -x|| 2 Average squared distance Cost = (1/n 2 ) i j ||x i -x j || 2
Lecture 7Information Retrieval and Digital LibrariesPage 6 Cluster Similarity Measure Measures the similarity of two clusters C i, C j 1. d min (C i, C j ) = min x i C i,x j C j ||x i – x j || 2. d max (C i, C j ) = max x i C i,x j C j ||x i – x j || 3. d avg (C i, C j ) = (1/ n i n j ) x i C i x j C j ||x i – x j || 4. d mean (C i, C j ) = ||(1/n j ) x i C i x i – (1/n j ) ,x j C j x j || 5. …
Lecture 7Information Retrieval and Digital LibrariesPage 7 Iterative Clustering Assign points to initial k clusters Often this is done by random assignment Until done Select a candidate point x, in cluster c Find “best” cluster c’ for x If c c’, then move x to c’
Lecture 7Information Retrieval and Digital LibrariesPage 8 Iterative Clustering The user must pre-select the number of clusters Often the “correct” number is not known in advance! The quality of the outcome is usually dependent on the quality of the initial assignment Possibly use some other algorithm to create a good initial assignment?
Lecture 7Information Retrieval and Digital LibrariesPage 9 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity
Lecture 7Information Retrieval and Digital LibrariesPage 10 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity
Lecture 7Information Retrieval and Digital LibrariesPage 11 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity
Lecture 7Information Retrieval and Digital LibrariesPage 12 Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering gives a hierarchy of clusters This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5
Lecture 7Information Retrieval and Digital LibrariesPage 13 High density variations Intuitively “correct” clustering
Lecture 7Information Retrieval and Digital LibrariesPage 14 High density variations Intuitively “correct” clustering HAC-generated clusters
Lecture 7Information Retrieval and Digital LibrariesPage 15 Hybrid Combine HAC and iterative clustering Assign points to initial clusters using HAC Until done Select a candidate point x, in cluster c Find “best” cluster c’ for x If c c’, then move x to c’
Lecture 7Information Retrieval and Digital LibrariesPage 16 Other Algorithms Support Vector Clustering Information Bottleneck …
Lecture 7Information Retrieval and Digital LibrariesPage 17 High density variations