Topic Oriented Semi-supervised Document Clustering Jiangtao Qiu, Changjie Tang Computer School, Sichuan University 2018/12/5 SIGMOD-IDAR 2007
OUTLINE 1.Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007
1. INTRODUCTION Developing a Text Mining Prototype System. Aim to mine associative event, generate hypotheses etc. At present, we have complete Content Extracting from web page, Document Classification, Document Cluster. 2018/12/5 SIGMOD-IDAR 2007
1. INTRODUCTION Prototype System Presenting Mining associative Events etc. Mining Prototype System Needed Vectors Deriving needed texts Classification Cluster Remove noise Get feature vector Preprocess Collecting data Text Web pages 2018/12/5 SIGMOD-IDAR 2007
OUTLINE 2.Motivation 1. Introduction 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007
2. MOTIVATION Traditional documents clustering are usually considered an unsupervised learning. General Method: documents Extracting Feature Vector Computing Similarity among vectors Building dissimilarity matrix Implementing Clustering 2018/12/5 SIGMOD-IDAR 2007
2. Motivation Can we group documents by users need? New Challenge 2018/12/5 SIGMOD-IDAR 2007
OUTLINE 3.Topic Semantic Annotation 1. Introduction 2. Motivation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation we propose a new semi-supervised documents clustering approach It can group documents according to user’s need Topic oriented documents clustering 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. We use concept set C in ontology as attributes set. Attributes of topic consist of a collection of concepts {p1,..,pn} C; attributes can well describe the topic. 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.1 How to represent user’s need? For Example: Collecting documents about Yao Ming. There are several peoples named Yao Ming in corpus. We want to group documents by different Yao Ming. We set ‘Yao Ming’ as topic. We choose background, place , named entity as attributes. 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. background Cancer medicine For instance, when words coach, stadium emerge in a document, it can be inferred that the peoples involved in this document is related to ‘sport’. 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. background Cancer medicine We have modified ontology, which added background for words in ontology 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 2.Place can well distinguish different peoples. The places where peoples have grown up and lived may well distinguish different peoples. 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 3.Named entities may be used to describe semantic of topic. Some people names, institution and organization names that do not occur in dictionary are called named entity. Named entities may be used to describe semantic of topic. 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? We represent relationship between topic and documents by annotating topic-semantic for documents 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} If ti may be mapped to one attribute pj Ontology ti pj 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} And ti is semantical correlation with T If distance of ti and T is not lager than threshold, We call ti and T is semantical correlation 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} Insert ti into vector Pj Vector Pj ={…, ti} 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} When all words are explored, we can derived Attributes Vectors: P1 ={…, ti} … Pn ={…, tm} 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} We call the above process topic-semantic annotation 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} P3={< Rasheed Wallace, 1>, < Shane Battier, 1>, < Auburn Hills, 1>} 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} P3={< Rasheed Wallace, 1>, < Shane Battier, 1>, < Auburn Hills, 1>} 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007
3. Topic Semantic Annotation 3.3 How to evaluate similarity of documents by the need? d1 d2 V1={…} … Vn={…} V1={…} … Vn={…} 2018/12/5 SIGMOD-IDAR 2007
OUTLINE 4.Optimizing Hierarchical Clustering 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4.Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007
4. Optimizing Hierarchical Clustering Motivation: Current clustering algorithms often need user to set some parameters such as the number of clusters, radius or density threshold. If users lack experience to choice parameters, it is difficult to produce good clustering solution. 2018/12/5 SIGMOD-IDAR 2007
4. Optimizing Hierarchical Clustering Solution: 1.build clustering tree by using hierarchical clustering algorithm. 2.recommend best clustering solution on clustering tree to users by using a criterion function. 2018/12/5 SIGMOD-IDAR 2007
4. Optimizing Hierarchical Clustering Solution: Worst Solution five clusters One cluster All samples in one cluster Each samples is one cluster 2018/12/5 SIGMOD-IDAR 2007
4. Optimizing Hierarchical Clustering Solution: Combining inner-cluster distance with intra-cluster distance, We propose a criterion function. the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007
4. Optimizing Hierarchical Clustering Level 5 Level 4 Level 3 Level 2 Bottom up Level 1 A B C D E the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007
4. Optimizing Hierarchical Clustering Level 5 The smallest DistanceSum Level 4 Level 3 Level 2 A B C D E Level 1 the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007
OUTLINE 5.Experiments 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5.Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007
5. Experiments To the best our knowledge, topic oriented document clustering has not been well addressed in the existing works. Experiments, in this study, will compare our approach to the unsupervised clustering approach 2018/12/5 SIGMOD-IDAR 2007
5. Experiments Dataset: Collect web pages involved three peoples named ‘Li Ming’. purpose: clustering documents by people. 2018/12/5 SIGMOD-IDAR 2007
5. Experiments Experiment 1: Comparing on Time performance TFIDF 2018/12/5 SIGMOD-IDAR 2007
5. Experiments Experiment 1: Comparing Dimensionality TFIDF 2018/12/5 SIGMOD-IDAR 2007
5. Experiments Experiment 2: 1. Using new approach and traditional approach to build dissimilarity matrix 2. Implement documents clustering on matrix 3. compare clustering solution by using F-Measure 2018/12/5 SIGMOD-IDAR 2007
5. Experiments Experiment 2: ODSA 5 56 TFIDF(1) 7 40.7 TFIDF(2) 38.9 Number of cluster F (%) ODSA 5 56 TFIDF(1) 7 40.7 TFIDF(2) 38.9 TFIDF(3) 37 TFIDF(4) 33.7 TFIDF(5) 33 2018/12/5 SIGMOD-IDAR 2007
OUTLINE 6.Conclusion 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6.Conclusion 2018/12/5 SIGMOD-IDAR 2007
6. Conclusion Experiments show that new approach is feasible and effective. To further improve performance, However, some works need be done such as improving accuracy on named entity recognizing 2018/12/5 SIGMOD-IDAR 2007
Thanks! Any Question? 2018/12/5 SIGMOD-IDAR 2007