Download presentation
Presentation is loading. Please wait.
Published byErika Davis Modified over 8 years ago
1
1 CS 430: Information Discovery Lecture 24 Cluster Analysis
2
2 Course Administration This week Regular class schedule Next week No lecture on Tuesday Assignments Assignment 4 (optional, extra credit) has been posted Assignment 3 grades will be returned today Final examination Friday, Dec 14, 9:00 - 10:30 a.m. There will not be an early examination
3
3 Cluster Analysis Methods that divide a set of n objects into m non- overlapping subsets. For information discovery, cluster analysis is applied to terms for thesaurus construction documents to divide into categories (sometimes called automatic classification, but classification usually requires a pre-determined set of categories).
4
4 Non-hierarchical and Hierarchical Methods Non-hierarchical methods Elements are divided into the m non-overlapping sets where m is predetermined. Hierarchical methods m is varied progressively to create a hierarchy of solutions. Agglomerative methods m is initially equal to n, the total number of elements, where every element is considered to be a cluster with one element. The hierarchy is produced by incrementally combining clusters.
5
5 Simple methods Single link At each step, join the most similar pair of elements that are not yet in the same cluster. May lead to long, straggling clusters (chaining). Complete link Calculate the similarity between two clusters as the similarity between the least similar pair of elements in the two clusters. At each step, merge the two most similar clusters. Generates small, tightly bound clusters
6
6 Similarities Clustering depends on a measure of similarity between elements. One measure of document similarity is the number of documents that have terms i and k in common: S(t j, t k ) = t ij t ik where t ij = 1 if document i contains term j and 0 otherwise. i=1 n
7
7 Similarity measures Improved similarity measures can be generated by: Using term frequency matrix instead of incidence matrix Weighting terms by frequency: cosine measure t ij t ik |t j | |t k | dice measure t ij t ik t ik + t ij i=1 n n n n S(t j, t k ) =
8
8 Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D 1 1 1 1 1 1 1 1 D 2 1 1 1 D 3 1 1 1 1 D 4 1 1 1 1 73447344
9
9 Document similarity matrix D 1 D 2 D 3 D 4 D 1 0.65 0.760.76 D 2 0.65 0.000.87 D 3 0.760.00 0.25 D 4 0.760.87 0.25 Average similarity = 0.55
10
10 Similarities: Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D 1 1 1 1 1 1 1 1 D 2 1 1 1 D 3 1 1 1 1 D 4 1 1 1 1 n 3 2 2 3 2 3 3
11
11 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha 0.2 0.2 0.5 0.2 0.33 0.5 bravo 0.2 0.5 0.2 0.5 0.4 0.2 charlie 0.2 0.5 0.2 0.5 0.4 0.2 delta 0.5 0.2 0.2 0.2 0.33 0.5 echo 0.2 0.5 0.5 0.2 0.4 0.2 foxtrot 0.33 0.4 0.4 0.33 0.4 0.33 golf 0.5 0.2 0.2 0.5 0.2 0.33 Using incidence matrix and dice weighting
12
12 Example -- single link alpha delta golf bravo echo charlie foxtrot 1 2 3 6 4 5 This style of diagram is called a dendrogram.
13
13 Algorithms for Cluster Analysis Algorithms for clustering analysis are the subject of the reading for the final discussion class.
14
14 Example 1: Cluster Analysis of Social Science Journal In the social sciences, subject boundaries are unclear. Can citation patterns be used to develop criteria for matching information services to the interests of users? W. Y. Arms and C. R. Arms, Cluster analysis used on social science citations, Journal of Documentation, 34 (1) pp 1-11, March 1978.
15
15 Methodology Assumption: Two journals are close to each other if they are cited by the same source journals, with similar relative frequencies. Sources of citations: Select a sample of n social science journals. Citation matrix: Construct an m x n matrix in which the ijth element is the number of citations to journal i from journal j. Normalization: All data was normalized so that the sum of the elements in each row is 1.
16
16 Data Pilot study: 5,000 citations from the 1970 volumes of 17 major journals from across the social sciences. Criminology citations: Every fifth citation from a set of criminology journals (3 sets of data for 1950, 1960, 1970). Main file (52,000 citations): (a)Every citation from the 1970 volumes of the 48 most cited source journals in the pilot study. (b)Every citation from the 1970 volumes of 47 randomly selected journals.
17
17 Sample sizes SampleSource journalsTarget journals Pilot17115 Criminology: 19501018 19601349 197027108 Main file: ranked48495 random47254 Excludes journals that are cited by only one source. These were assumed to cluster with the source.
18
18 Algorithms Main analysis used a non-hierarchical method of E. M. L. Beale and M. G. Kendal based on Euclidean distance. For comparison, 36 psychology journals clustered using: single-linkage complete-linkage van Rijsbergen's algorithm Beale/Kendal algorithm and complete-linkage produced similar results. Single-linkage suffered from chaining. Van Rijsbergen algorithm seeks very clear-cut clusters, which were not found in the data.
19
19 Non-hierarchical clusters Economics clusters in the pilot study
20
20 Non-hierarchical dendrogram Part of a dendrogram showing non-hierarchical structure
21
21 Conclusion "The overall conclusion must be that cluster analysis is not a practical method of designing secondary services in the social sciences." Because of skewed distributions very large amounts of data are required. Results are complex and difficult to interpret. Overlap between social sciences leads to results that are sensitive to the precise data and algorithms chosen.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.