Distributional Clustering of Words for Text Classification L. Douglas Baker Andrew Kachites McCallum SIGIR’98
Distributional Clustering Word similarity based on class label distribution Word similarity based on class label distribution ‘puck’ and ‘goalie’ ‘puck’ and ‘goalie’ ‘team’ ‘team’
Distributional Clustering Clustering words based on class distribution - (supervised) Clustering words based on class distribution - (supervised) Similarity between w t & w s similarity between P(C|w t ) & P(C|w s ) Similarity between w t & w s similarity between P(C|w t ) & P(C|w s ) Information theoretic measure to calculate similarity between distributions Information theoretic measure to calculate similarity between distributions Kullback-Leibler divergence to the mean Kullback-Leibler divergence to the mean
Distributional Clustering Class 8: Autos and Class 9: Motorcycles
Distributional Clustering
Kullback-Leibler Divergence Here, D is asymmetric and D infinity when P(y)=0 and P(x)≠0 Also, D ≥ 0
Kullback-Leibler Divergence Where, Jensen-Shannon Divergence is a special case of symmetrised KL-Divergence. P(w t )=P(w s )=0.5
Clustering Algorithm Characteristics: -Greedy Aggressive -Local Optimal -Hard Clustering -Agglomerative
Experiments Dataset: Dataset: 20 Newsgroups 20 Newsgroups Reuters Reuters Yahoo Science Hierarchy Yahoo Science Hierarchy Compared with: Compared with: Supervised Latent Semantic indexing Supervised Latent Semantic indexing Class-based clustering Class-based clustering Feature selection by mutual information with the class variable Feature selection by mutual information with the class variable Feature selection by Markov-blanket method Feature selection by Markov-blanket method Classifier : NBC Classifier : NBC
Results
Conclusion Useful semantic word clusterings Useful semantic word clusterings Higher classification accuracy Higher classification accuracy Smaller classification models Smaller classification models Word clustering vs. feature selection ?? What if the data is Noisy?? Noisy?? Sparse?? Sparse??