Download presentation
Presentation is loading. Please wait.
Published byRaymond Bell Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics Presenter : Cheng-Hui Chen Authors : Jian Hu, Lujun Fang, Yang Cao, Hua-Jun Zeng, Hua Li, Qiang Yang, Zheng Chen SIGIR, 2008
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outlines Motivation Objectives Methodology Experiments Conclusions Comments
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Most traditional text clustering methods ignores the important information on the semantic relationships between key terms. Lack of an effective word sense disambiguation method. The synonymy and polysemy are not easy to handle the problems. 3
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives Enhance the clustering result by obtaining a more accurate distance measure. The generated thesaurus serves as a control vocabulary that bridges the variety of idiolects and terminologies present in the document corpus. 4
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Wikipedia Thesaurus Traditional text clustering The framework 5
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Wikipedia Thesaurus ─ Wikipedia Concept ─ Synonymy ─ Polysemy ─ Hypernymy (Hierarchical Relation) ─ Associative relations Content based measure Out-linked category based measure Combination of the two measure 6
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Traditional text clustering ─ Traditional text similarity measure Compute cosine similarity ─ Traditional text representation enrichment strategies Generated new features replace or append to original document, and construct new vector representation 7
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology The framework ─ Mapping text documents into wikipedia concept sets Considering frequently occurred synonymy, polysemy and hypernymy in text documents, accurate allocation of terms in Wikipedia. ─ Enriching similarity measure with hierarchical relation Similarity measure using category vectors Considering the original document content, the similarity measure can be represented as: 8 Use the category decay factor μ
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology Enriching similarity measure with synonym and associative relation ─ The expanded weighted concept set ─ The set C ext to C b and get the extended C b as: ─ Define the similarity 9 C a ={(CS,1),(ML,1)} C b ={(DM,1),(DB,1)} = 0.57
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology The combination 10 The set α and β to equal weights α =β =1/ 3
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Evaluation Criteria ─ M = M 1,M 2,...,M n represent the n manually labeled clusters, C = C 1,C 2,...,C n represent the n clusters generated using our algorithm. ─ Precision of C i and M j is defined: ─ The purity of the clustering result is defined: ─ The corresponding inverse purity is defined: 11 C M N 1 = 180 N 2 = 100 N 1 ∩N 2 = 30 Inpurity = 30%Purity = 33.33%
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments BASE1: Traditional text document similarity measure. BASE2: Improved with Gabrilovich’s feature generation technique on Wikipedia. BASE3: K-Means clustering with Hotho’s text document representation enrichment with WordNet. 12
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions The clustering performance of our method is improved compared with previous methods. The future work ─ Use the multilingual relations to explore the application in Cross language Information Retrieval and Cross- language Text Categorization. 14
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Comments Advantages ─ Improved text clustering performance. Applications ─ Clustering ─ Information Retrieval
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.