The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College London, UK Learning Topic Hierarchies from Text Documents using a Scalable Hierarchical Fuzzy Clustering Method E. Mendes Rodrigues and L. Sacks {mmendes, E. Mendes Rodrigues and L. Sacks {mmendes,
Outline Document clustering process H-FCM: Hyper-spherical Fuzzy C-Means H 2 -FCM: Hierarchical H-FCM Clustering experiments Topic hierarchies
Document Clustering Process Document Representation Document Encoding Document Clustering Pre- processing Document Clusters Document Similarity Clustering Method Cluster Validity Document Collection Application Document Clustering Document Similarity Clustering Method Document Collection Document Representation Document Encoding Pre- processing Document Clusters Cluster Validity Application Identify all unique words in the document collection Discard common words that are included in the stop list Apply stemming algorithm and combine identical word stems Apply term weighting scheme to the final set of k indexing terms Discard terms using pre-processing filters Document Vectors x 11 x 12 x 1k x 21 x 22 x N1 x N2 x Nk X = Vector-Space Model of Information Retrieval Very high-dimensional Very sparse (+95%)
Measures of Document Relationship FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms Cosine (dis)similarity measure: widely applied in Information Retrieval represents the cosine of the angle between two document vectors insensitive to different document lengths, since it is normalised by the length of the document vectors
H-FCM: Hyper-spherical Fuzzy C-Means Applies the cosine measure to assess document relationships Modified objective function: Subject to an additional constraint: Fuzzy memberships (u) and cluster centroids (v):
How many clusters? Usually the final number of clusters is not know a priori Run the algorithm for a range of c values Apply validity measures and determine which c leads to the best partition (clusters compactness, density, separation, etc.) How compact and dense are clusters in a sparse high- dimensional problem space? Very small percentage of documents within a cluster present high similarity to the respective centroid clusters are not compact However, there is always a clear separation between intra- and inter- cluster similarity distributions
H 2 -FCM: Hierarchical Hyper-spherical Fuzzy C-Means Key concepts Apply partitional algorithm (H-FCM) to obtain a sufficiently large number of clusters Exploit the granularity of the topics associated with each cluster to link cluster centroids hierarchically Form a topic hierarchy Asymmetric similarity measure Identify parent-child type relationships between cluster centroids Child should be less similar to parent, than parent to child
S(v 8,v 5 )<t PCS C1 C3 C6 C9 C10 C12 C11 C8 C7 C4 C2 C5 Document Cluster centroid The H 2 -FCM Algorithm Asymmetric Similarity v1v1 v3v3 v6v6 v9v9 v 10 v 12 v 11 v8v8 v7v7 v4v4 v2v2 v5v5 v1v1 v3v3 v6v6 v9v9 v 10 v 12 v 11 v8v8 v7v7 v4v4 v2v2 v5v5 v V F S(v ,v ) = max[S(v ,v )], v ,v V F v3v3 v6v6 v9v9 v 10 v 12 v 11 v7v7 v4v4 v1v1 v8v8 v2v2 v5v5 v1v1 v8v8 v2v2 S(v 1,v 5 )≥t PCS v 10 VFVF VHVH S(v 8,v 1 )<t PCS Compute S(v ,v ), Y Apply H-FCM (c, m) All clusters have size≥t ND ? Select centroid While V F ≠ VH=?VH=? N Add root Select parent S≥t PCS ? Add child Y N N c=c-Kc=c-K
Scalability of the Algorithm H 2 -FCM time complexity depends on H-FCM and centroid linking heuristic H-FCM computation time is O(Nc 2 k) Linking heuristic is at most O(c 2 k) Computation of the asymmetric similarity between every pair of cluster centroids - O(c 2 k) Generation of the cluster hierarchy - O(c 2 ) Overall, H2-FCM time complexity is O(Nc 2 k) Scales well to large document sets!
Description of Experiments Goal: evaluate the H 2 -FCM performance Evaluation measures: clustering Precision (P) and Recall (R) H 2 -FCM algorithm run for a range of c values No. hierarchy roots=No. reference classes t PCS dynamically set Are sub-clusters of the same topic assigned to the same branch? In reference class Not in reference class Assigned to cluster true positives (tp)false positives (fp) Not assigned to cluster false negatives (fn)true negatives (tn)
Test Document Collections Reuters test collection: Open Directory Project (ODP): INSPEC database: Collection SizeClassesDocument lengthDocument sparsity Nkno.labelsavgstdevavgstdev reuters acq earn trade %0.26 % reuters crude interest money-fx ship trade %0.47 % odp game lego math safety sport %0.50 % inspec back-propagation fuzzy control Pattern clustering %0.14 %
Clustering Results: H2-FCM Precision and Recall odpinspec reuters1 reuters2
Topic Hierarchy Each centroid vector consists of a set of weighted terms Terms describe the topics associated with the document cluster Centroid hierarchy produces a topic hierarchy Useful for efficient access to individual documents Provides context to users in exploratory information access
Topic Hierarchy Example
Concluding Remarks H 2 -FCM clustering algorithm Partitional clustering (H-FCM) Linking heuristic organizes centroids hierarchically bases on asymmetric similarity Scales linearly with the number of documents Exhibits good clustering performance Topic hierarchy can be extracted from the centroid hierarchy
Clustering in Sparse High-dimensional Spaces reuters1 reuters2 Intra- and inter-cluster similarity CDFs cc cc
Clustering in Sparse High-dimensional Spaces (contd.) odp inspec Intra- and inter-cluster similarity CDFs cc cc
Iterative optimization of an objective function: Subject to constraints: Fuzzy memberships (u) and cluster centroids (v): FCM: Fuzzy C-Means