Topic Oriented Semi-supervised Document Clustering

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Presented by Zeehasham Rasheed
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
2008/06/06 Y.H.Chang Towards Effective Browsing of Large Scale Social Annotations1 Towards Effective Browsing of Large Scale Social Annotations WWW 2007.
Blaz Fortuna, Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute ONTOGEN SEMI-AUTOMATIC ONTOLOGY EDITOR.
JingTao Yao Growing Hierarchical Self-Organizing Maps for Web Mining Joseph P. Herbert and JingTao Yao Department of Computer Science, University or Regina.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.
Prepared by: Mahmoud Rafeek Al-Farra
Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.
Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Data Mining and Decision Support
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Introduction to Machine Learning, its potential usage in network area,
Experience Report: System Log Analysis for Anomaly Detection
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Sentiment analysis algorithms and applications: A survey
System for Semi-automatic ontology construction
Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009
Personalized Social Image Recommendation
Topic 3: Cluster Analysis
Restrict Range of Data Collection for Topic Trend Detection
Social Knowledge Mining
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
Learning with information of features
Presented by: Prof. Ali Jaoua
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Block Matching for Ontologies
Magnet & /facet Zheng Liang
Mashup Service Recommendation based on User Interest and Service Network Buqing Cao ICWS2013, IJWSR.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Group 9 – Data Mining: Data
Topic 5: Cluster Analysis
Hierarchical, Perceptron-like Learning for OBIE
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Topic: Semantic Text Mining
Presentation transcript:

Topic Oriented Semi-supervised Document Clustering Jiangtao Qiu, Changjie Tang Computer School, Sichuan University 2018/12/5 SIGMOD-IDAR 2007

OUTLINE 1.Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

1. INTRODUCTION Developing a Text Mining Prototype System. Aim to mine associative event, generate hypotheses etc. At present, we have complete Content Extracting from web page, Document Classification, Document Cluster. 2018/12/5 SIGMOD-IDAR 2007

1. INTRODUCTION Prototype System Presenting Mining associative Events etc. Mining Prototype System Needed Vectors Deriving needed texts Classification Cluster Remove noise Get feature vector Preprocess Collecting data Text Web pages 2018/12/5 SIGMOD-IDAR 2007

OUTLINE 2.Motivation 1. Introduction 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

2. MOTIVATION Traditional documents clustering are usually considered an unsupervised learning. General Method: documents Extracting Feature Vector Computing Similarity among vectors Building dissimilarity matrix Implementing Clustering 2018/12/5 SIGMOD-IDAR 2007

2. Motivation Can we group documents by users need? New Challenge 2018/12/5 SIGMOD-IDAR 2007

OUTLINE 3.Topic Semantic Annotation 1. Introduction 2. Motivation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation we propose a new semi-supervised documents clustering approach It can group documents according to user’s need Topic oriented documents clustering 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. We use concept set C in ontology as attributes set. Attributes of topic consist of a collection of concepts {p1,..,pn} C; attributes can well describe the topic. 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.1 How to represent user’s need? For Example: Collecting documents about Yao Ming. There are several peoples named Yao Ming in corpus. We want to group documents by different Yao Ming. We set ‘Yao Ming’ as topic. We choose background, place , named entity as attributes. 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. background Cancer medicine For instance, when words coach, stadium emerge in a document, it can be inferred that the peoples involved in this document is related to ‘sport’. 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. background Cancer medicine We have modified ontology, which added background for words in ontology 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 2.Place can well distinguish different peoples. The places where peoples have grown up and lived may well distinguish different peoples. 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 3.Named entities may be used to describe semantic of topic. Some people names, institution and organization names that do not occur in dictionary are called named entity. Named entities may be used to describe semantic of topic. 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? We represent relationship between topic and documents by annotating topic-semantic for documents 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} If ti may be mapped to one attribute pj Ontology ti pj 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} And ti is semantical correlation with T If distance of ti and T is not lager than threshold, We call ti and T is semantical correlation 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} Insert ti into vector Pj Vector Pj ={…, ti} 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} When all words are explored, we can derived Attributes Vectors: P1 ={…, ti} … Pn ={…, tm} 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Document S Topic T Attributes:p1,.., pn Words {t1,…, tn} We call the above process topic-semantic annotation 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} P3={< Rasheed Wallace, 1>, < Shane Battier, 1>, < Auburn Hills, 1>} 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={<sport, 4>} P2={<Huston, 1>, <Michigan, 1>,< Detroit,1 >} P3={< Rasheed Wallace, 1>, < Shane Battier, 1>, < Auburn Hills, 1>} 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 2018/12/5 SIGMOD-IDAR 2007

3. Topic Semantic Annotation 3.3 How to evaluate similarity of documents by the need? d1 d2 V1={…} … Vn={…} V1={…} … Vn={…} 2018/12/5 SIGMOD-IDAR 2007

OUTLINE 4.Optimizing Hierarchical Clustering 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4.Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

4. Optimizing Hierarchical Clustering Motivation: Current clustering algorithms often need user to set some parameters such as the number of clusters, radius or density threshold. If users lack experience to choice parameters, it is difficult to produce good clustering solution. 2018/12/5 SIGMOD-IDAR 2007

4. Optimizing Hierarchical Clustering Solution: 1.build clustering tree by using hierarchical clustering algorithm. 2.recommend best clustering solution on clustering tree to users by using a criterion function. 2018/12/5 SIGMOD-IDAR 2007

4. Optimizing Hierarchical Clustering Solution: Worst Solution five clusters One cluster All samples in one cluster Each samples is one cluster 2018/12/5 SIGMOD-IDAR 2007

4. Optimizing Hierarchical Clustering Solution: Combining inner-cluster distance with intra-cluster distance, We propose a criterion function. the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007

4. Optimizing Hierarchical Clustering Level 5 Level 4 Level 3 Level 2 Bottom up Level 1 A B C D E the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007

4. Optimizing Hierarchical Clustering Level 5 The smallest DistanceSum Level 4 Level 3 Level 2 A B C D E Level 1 the best clustering solution may be provided to user by using a criterion function without parameter setting. 2018/12/5 SIGMOD-IDAR 2007

OUTLINE 5.Experiments 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5.Experiments 6. Conclusion 2018/12/5 SIGMOD-IDAR 2007

5. Experiments To the best our knowledge, topic oriented document clustering has not been well addressed in the existing works. Experiments, in this study, will compare our approach to the unsupervised clustering approach 2018/12/5 SIGMOD-IDAR 2007

5. Experiments Dataset: Collect web pages involved three peoples named ‘Li Ming’. purpose: clustering documents by people. 2018/12/5 SIGMOD-IDAR 2007

5. Experiments Experiment 1: Comparing on Time performance TFIDF 2018/12/5 SIGMOD-IDAR 2007

5. Experiments Experiment 1: Comparing Dimensionality TFIDF 2018/12/5 SIGMOD-IDAR 2007

5. Experiments Experiment 2: 1. Using new approach and traditional approach to build dissimilarity matrix 2. Implement documents clustering on matrix 3. compare clustering solution by using F-Measure 2018/12/5 SIGMOD-IDAR 2007

5. Experiments Experiment 2: ODSA 5 56 TFIDF(1) 7 40.7 TFIDF(2) 38.9 Number of cluster F (%) ODSA 5 56 TFIDF(1) 7 40.7 TFIDF(2) 38.9 TFIDF(3) 37 TFIDF(4) 33.7 TFIDF(5) 33 2018/12/5 SIGMOD-IDAR 2007

OUTLINE 6.Conclusion 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6.Conclusion 2018/12/5 SIGMOD-IDAR 2007

6. Conclusion Experiments show that new approach is feasible and effective. To further improve performance, However, some works need be done such as improving accuracy on named entity recognizing 2018/12/5 SIGMOD-IDAR 2007

Thanks! Any Question? 2018/12/5 SIGMOD-IDAR 2007