Download presentation
Presentation is loading. Please wait.
Published bySarah Flynn Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library MEDLINE Advisor : Dr. Hsu Graduate : Jing Wei Lin Authors :lllhoi Yoo and Xiaohua Hu 2006.JCDL.10
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Use MeSH Ontology on Vector Space Model Classification of Document Clustering Approaches Experiment Results Conclusion
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Document clustering has been used for better document retrieval, document browsing, and text mining in digital library.
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective The goal of this paper is to perform a comprehensive comparison study of various document clustering approaches in terms of the efficiency, the effectiveness, and the scalability.
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Use MeSH Ontology on Vector Space Model Problem : ─ Each document d is represented as a high dimensional vector of words/terms frequencies. Solution : For example, “Neoplasms” as a descriptor has the following entry terms {“Cancer”, “Cancers”, “Neoplasm”, “Tumors”}.
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Classification of Document Clustering Approaches Hierarchical clustering algorithms ─ Advantage: they generate a document hierarchy so that users can drill up and drill down for specific topics of interest. ─ Disadvantage: because of their cubic time complexity, they are very much limited for very large documents. Partitional clustering algorithms ─ Advantage: their time complexity is lower, for example, K-means is O(k*T*n). ─ Disadvantage: clustering results are heavily sensitive to the initial centroids because the centroids are randomly selected.
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 w CR CL w CR CL w CR CL ….
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Experiment Results How much does the selection method of Bisecting K- means for clusters to be bisected affect clustering results? 0.44 0.27 0.06 0.21
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Experiment Results Does Bisecting K-means outperform K-means?
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Experiment Results Does STC outperform hierarchical or partitional approaches? STC v.s. Hierarchical STC v.s. Partitional
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Experiment Results Do partitional clustering algorithms outperform hierarchical clustering algorithms?
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Experiment Results Which of clustering algorithms is the most scalable or the least scalable?
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Experiment Results How much does a domain ontology MeSH improve document clustering?
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Experiment Results
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Experiment Results How the clustering evaluation metrics are related to one another? The smaller MI and Entropy, the better clustering quality while the bigger F-measure and purity, the better clustering quality.
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Conclusion The cluster selection methods of Bisecting K- means sufficiently affect clustering quality. Bisecting Kmeans generally outperforms K-means if the cluster selection method Type A of Bisecting K-means is used. STC provides better clustering solutions than hierarchical algorithms but is worse than partitional clustering approaches.
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Conclusion Partitional clustering approaches are significantly superior to hierarchical approaches in terms of clustering evaluation metrics and the running times. Bisecting K-means is normally superior to other clustering methods. A domain ontology MeSH improves document clustering for MEDLINE articles.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.