Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic Oriented Semi-supervised Document Clustering

Similar presentations


Presentation on theme: "Topic Oriented Semi-supervised Document Clustering"— Presentation transcript:

1 Topic Oriented Semi-supervised Document Clustering
Jiangtao Qiu, Changjie Tang Computer School, Sichuan University 2018/12/5 SIGMOD-IDAR 2007

2 OUTLINE 1.Introduction 2. Motivation 3. Topic Semantic Annotation
4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

3 1. INTRODUCTION Developing a Text Mining Prototype System.
Aim to mine associative event, generate hypotheses etc. At present, we have complete Content Extracting from web page, Document Classification, Document Cluster. 2018/12/5 SIGMOD-IDAR 2007

4 1. INTRODUCTION Prototype System Presenting
Mining associative Events etc. Mining Prototype System Needed Vectors Deriving needed texts Classification Cluster Remove noise Get feature vector Preprocess Collecting data Text Web pages 2018/12/5 SIGMOD-IDAR 2007

5 OUTLINE 2.Motivation 1. Introduction 3. Topic Semantic Annotation
4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

6 2. MOTIVATION Traditional documents clustering are usually considered an unsupervised learning. General Method: documents Extracting Feature Vector Computing Similarity among vectors Building dissimilarity matrix Implementing Clustering 2018/12/5 SIGMOD-IDAR 2007

7 2. Motivation Can we group documents by users need? New Challenge
2018/12/5 SIGMOD-IDAR 2007

8 OUTLINE 3.Topic Semantic Annotation 1. Introduction 2. Motivation
4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

9 3. Topic Semantic Annotation
we propose a new semi-supervised documents clustering approach It can group documents according to user’s need Topic oriented documents clustering 2018/12/5 SIGMOD-IDAR 2007

10 3. Topic Semantic Annotation
Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007

11 3. Topic Semantic Annotation
3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. 2018/12/5 SIGMOD-IDAR 2007

12 3. Topic Semantic Annotation
3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. We use concept set C in ontology as attributes set. Attributes of topic consist of a collection of concepts {p1,..,pn} C; attributes can well describe the topic. 2018/12/5 SIGMOD-IDAR 2007

13 3. Topic Semantic Annotation
3.1 How to represent user’s need? For Example: Collecting documents about Yao Ming. There are several peoples named Yao Ming in corpus. We want to group documents by different Yao Ming. We set ‘Yao Ming’ as topic. We choose background, place , named entity as attributes. 2018/12/5 SIGMOD-IDAR 2007

14 3. Topic Semantic Annotation
3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. background Cancer medicine For instance, when words coach, stadium emerge in a document, it can be inferred that the peoples involved in this document is related to ‘sport’. 2018/12/5 SIGMOD-IDAR 2007

15 3. Topic Semantic Annotation
3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. background Cancer medicine We have modified ontology, which added background for words in ontology 2018/12/5 SIGMOD-IDAR 2007

16 3. Topic Semantic Annotation
3.1 How to represent user’s need? Reason for choosing the three attributes. 2.Place can well distinguish different peoples. The places where peoples have grown up and lived may well distinguish different peoples. 2018/12/5 SIGMOD-IDAR 2007

17 3. Topic Semantic Annotation
3.1 How to represent user’s need? Reason for choosing the three attributes. 3.Named entities may be used to describe semantic of topic. Some people names, institution and organization names that do not occur in dictionary are called named entity. Named entities may be used to describe semantic of topic. 2018/12/5 SIGMOD-IDAR 2007

18 3. Topic Semantic Annotation
Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007

19 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? We represent relationship between topic and documents by annotating topic-semantic for documents 2018/12/5 SIGMOD-IDAR 2007

20 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} If ti may be mapped to one attribute pj Ontology ti pj 2018/12/5 SIGMOD-IDAR 2007

21 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} And ti is semantical correlation with T If distance of ti and T is not lager than threshold, We call ti and T is semantical correlation 2018/12/5 SIGMOD-IDAR 2007

22 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} Insert ti into vector Pj Vector Pj ={…, ti} 2018/12/5 SIGMOD-IDAR 2007

23 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} When all words are explored, we can derived Attributes Vectors: P1 ={…, ti} Pn ={…, tm} 2018/12/5 SIGMOD-IDAR 2007

24 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} We call the above process topic-semantic annotation 2018/12/5 SIGMOD-IDAR 2007

25 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. 2018/12/5 SIGMOD-IDAR 2007

26 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming 2018/12/5 SIGMOD-IDAR 2007

27 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity 2018/12/5 SIGMOD-IDAR 2007

28 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} 2018/12/5 SIGMOD-IDAR 2007

29 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} 2018/12/5 SIGMOD-IDAR 2007

30 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} P3={< Rasheed Wallace, 1>, < Shane Battier, 1>, < Auburn Hills, 1>} 2018/12/5 SIGMOD-IDAR 2007

31 3. Topic Semantic Annotation
3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} P3={< Rasheed Wallace, 1>, < Shane Battier, 1>, < Auburn Hills, 1>} 2018/12/5 SIGMOD-IDAR 2007

32 3. Topic Semantic Annotation
Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007

33 3. Topic Semantic Annotation
3.3 How to evaluate similarity of documents by the need? d1 d2 V1={…} Vn={…} V1={…} Vn={…} 2018/12/5 SIGMOD-IDAR 2007

34 OUTLINE 4.Optimizing Hierarchical Clustering 1. Introduction
2. Motivation 3. Topic Semantic Annotation 4.Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

35 4. Optimizing Hierarchical Clustering
Motivation: Current clustering algorithms often need user to set some parameters such as the number of clusters, radius or density threshold. If users lack experience to choice parameters, it is difficult to produce good clustering solution. 2018/12/5 SIGMOD-IDAR 2007

36 4. Optimizing Hierarchical Clustering
Solution: 1.build clustering tree by using hierarchical clustering algorithm. 2.recommend best clustering solution on clustering tree to users by using a criterion function. 2018/12/5 SIGMOD-IDAR 2007

37 4. Optimizing Hierarchical Clustering
Solution: Worst Solution five clusters One cluster All samples in one cluster Each samples is one cluster 2018/12/5 SIGMOD-IDAR 2007

38 4. Optimizing Hierarchical Clustering
Solution: Combining inner-cluster distance with intra-cluster distance, We propose a criterion function. the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007

39 4. Optimizing Hierarchical Clustering
Level 5 Level 4 Level 3 Level 2 Bottom up Level 1 A B C D E the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007

40 4. Optimizing Hierarchical Clustering
Level 5 The smallest DistanceSum Level 4 Level 3 Level 2 A B C D E Level 1 the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007

41 OUTLINE 5.Experiments 1. Introduction 2. Motivation
3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5.Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

42 5. Experiments To the best our knowledge, topic oriented document clustering has not been well addressed in the existing works. Experiments, in this study, will compare our approach to the unsupervised clustering approach 2018/12/5 SIGMOD-IDAR 2007

43 5. Experiments Dataset: Collect web pages involved three peoples named ‘Li Ming’. purpose: clustering documents by people. 2018/12/5 SIGMOD-IDAR 2007

44 5. Experiments Experiment 1: Comparing on Time performance TFIDF
2018/12/5 SIGMOD-IDAR 2007

45 5. Experiments Experiment 1: Comparing Dimensionality TFIDF 2018/12/5
SIGMOD-IDAR 2007

46 5. Experiments Experiment 2:
1. Using new approach and traditional approach to build dissimilarity matrix 2. Implement documents clustering on matrix 3. compare clustering solution by using F-Measure 2018/12/5 SIGMOD-IDAR 2007

47 5. Experiments Experiment 2: ODSA 5 56 TFIDF(1) 7 40.7 TFIDF(2) 38.9
Number of cluster F (%) ODSA 5 56 TFIDF(1) 7 40.7 TFIDF(2) 38.9 TFIDF(3) 37 TFIDF(4) 33.7 TFIDF(5) 33 2018/12/5 SIGMOD-IDAR 2007

48 OUTLINE 6.Conclusion 1. Introduction 2. Motivation
3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6.Conclusion 2018/12/5 SIGMOD-IDAR 2007

49 6. Conclusion Experiments show that new approach is feasible and effective. To further improve performance, However, some works need be done such as improving accuracy on named entity recognizing 2018/12/5 SIGMOD-IDAR 2007

50 Thanks! Any Question? 2018/12/5 SIGMOD-IDAR 2007


Download ppt "Topic Oriented Semi-supervised Document Clustering"

Similar presentations


Ads by Google