Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library MEDLINE Advisor : Dr. Hsu Graduate : Jing Wei Lin Authors :lllhoi Yoo and Xiaohua Hu 2006.JCDL.10

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Use MeSH Ontology on Vector Space Model Classification of Document Clustering Approaches Experiment Results Conclusion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  Document clustering has been used for better document retrieval, document browsing, and text mining in digital library.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  The goal of this paper is to perform a comprehensive comparison study of various document clustering approaches in terms of the efficiency, the effectiveness, and the scalability.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Use MeSH Ontology on Vector Space Model  Problem : ─ Each document d is represented as a high dimensional vector of words/terms frequencies.  Solution : For example, “Neoplasms” as a descriptor has the following entry terms {“Cancer”, “Cancers”, “Neoplasm”, “Tumors”}.

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Classification of Document Clustering Approaches  Hierarchical clustering algorithms ─ Advantage: they generate a document hierarchy so that users can drill up and drill down for specific topics of interest. ─ Disadvantage: because of their cubic time complexity, they are very much limited for very large documents.  Partitional clustering algorithms ─ Advantage: their time complexity is lower, for example, K-means is O(k*T*n). ─ Disadvantage: clustering results are heavily sensitive to the initial centroids because the centroids are randomly selected.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 w CR CL w CR CL w CR CL ….

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Experiment Results  How much does the selection method of Bisecting K- means for clusters to be bisected affect clustering results? 0.44 0.27 0.06 0.21

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Experiment Results  Does Bisecting K-means outperform K-means?

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Experiment Results  Does STC outperform hierarchical or partitional approaches? STC v.s. Hierarchical STC v.s. Partitional

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Experiment Results  Do partitional clustering algorithms outperform hierarchical clustering algorithms?

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Experiment Results  Which of clustering algorithms is the most scalable or the least scalable?

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Experiment Results  How much does a domain ontology MeSH improve document clustering?

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Experiment Results

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Experiment Results  How the clustering evaluation metrics are related to one another? The smaller MI and Entropy, the better clustering quality while the bigger F-measure and purity, the better clustering quality.

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Conclusion  The cluster selection methods of Bisecting K- means sufficiently affect clustering quality.  Bisecting Kmeans generally outperforms K-means if the cluster selection method Type A of Bisecting K-means is used.  STC provides better clustering solutions than hierarchical algorithms but is worse than partitional clustering approaches.

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Conclusion  Partitional clustering approaches are significantly superior to hierarchical approaches in terms of clustering evaluation metrics and the running times.  Bisecting K-means is normally superior to other clustering methods.  A domain ontology MeSH improves document clustering for MEDLINE articles.


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering."

Similar presentations


Ads by Google