Adding Semantics to Email Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Introduction to Information Retrieval
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Concept-based Short Text Classification and Ranking
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Queensland University of Technology
DATA MINING © Prentice Hall.
Chapter 6 Classification and Prediction
Waikato Environment for Knowledge Analysis
Presented by: Prof. Ali Jaoua
Discriminative Frequent Pattern Analysis for Effective Classification
Text Categorization Berlin Chen 2003 Reference:
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer Science and Engineering, Hong Kong University of Science and Technology (ICDM 2006) Date: 2008/10/16 Speaker: Cho Chin Wei Advisor: Dr. Koh, JiaLing

Abstract  This paper presents a novel algorithm to cluster s according to their contents and the sentence styles of their subject lines. improve the clustering performance provide meaningful descriptions of the resulted clusters

Introduction  s become an important medium of communication.  A user may receive tens or even hundreds of s every day. Handling these s takes much time.  It is necessary to provide some automatic approaches to relieve the burden of processing the s.

Introduction  Supervised methods need a predefined taxonomy. Require users to update the taxonomy manually. Preparation of training data is time consuming and expensive.  Unsupervised technique such as clustering is an attractive alternative.

Introduction  Conventionally, clustering is based on the representation of bag-of-words.  This simplistic approach cannot take full advantage of valuable linguistic features inherent in the semi-structured s, which may result in unsatisfactory performance.

Method  In this paper, we present a novel technique to cluster s according to the sentence patterns discovered from the subject lines of the s.  Each subject line is treated as a sentence and parsed through natural language processing techniques.

Method  Automatically generate meaningful generalized sentence patterns (GSPs) from subjects of s. Natural Language Processing techniques frequent itemset mining techniques  Then, put forward a novel unsupervised approach treats GSPs as pseudo class labels conduct clustering in a supervised manner no human labeling is involved

What is GSP?  GSP (Generalized Sentence Pattern) It helps summarize the subjects of a large number of similar s which results in a semantic representation of the subject lines.  Ex: the GSP is {“person”, “seminar”, “date”} which means that someone (“person”) gives a “seminar” on someday (“date”).

Generalization of Terms - NLPWin  A Natural Language Parser, Microsoft ’ s NLPWin tool takes a sentence as input and builds a syntactic tree for the sentence

Generalization of Terms - NLPWin  NLPWin tool can generate factoids for noun phrases, e.g. person names, date and place.

Mine GSP  For each subject, Remove stop words NLPWin is used to generate its syntactic tree factoids of the nodes are added into the subjects The resultant subjects are called as generalized sentences.  Subsets of GSPs are still GSPs.  EX. {welcome, Bruce, Lee, person} {welcome, Bob, Brill, person} are the generalized sentences.

Mine GSP  Definition: Generalized Sentence Pattern (GSP) Given a set of generalized sentences S={s 1,s 2,...,s n } and a generalized sentence p, the support of p in S is defined as sup(p) =| {s i | s i ∈ S & p ⊆ s i } | Given a minimum support min_sup, if sup(p) > min_sup, then p is called as a generalized sentence pattern, or GSP for short.  EX. {welcome, person} and {interview, date}  Only closed GSPs are to be mined in our experiment.

GSPs grouping and selection  Grouping similar GSPs together to reduce the number of GSPs.  GSPs in the same group will present the same cluster.  The similarity between two GSPs p and q is defined based on their subset-superset relationship and support.  Sim(p,q) = 1 ; p ⊃ q : { sup(p) / sup(q) > min_conf } = 0 ; otherwise Min_conf is between 0.5 and 0.8 by experimental result. too low=> group more GSPs, noises too high=> fail to group

GSPs grouping and selection  The number of GSP groups can be several times larger than the actual number of clusters. We want to select some representative GSP groups by some heuristic rules. sort the GSP groups in descending order of length. sort them in descending order of support. select the first sp_num GSP groups for clustering.  A parameter sp_num is used to control how many GSP groups are select for clustering.  Longer GSP is more confident then a shorter one.

GSP as pseudo class label  Based on GSPs, we proposed a novel clustering algorithm to form a pseudo class for the s matching the same GSP group, and then use a discriminative variant of Classification Expectation Maximization algorithm (CEM) to get the final clusters.  When CEM is applied to document clustering, the high-dimension usually causes the inaccurate model estimation and degrades the efficiency. Here linear SVM is used as the underlying classifier.

GSP as pseudo class label  The proposed GSP-PCL algorithm uses GSP groups to construct initial pseudo classes.  Only the s not matching any GSP group are classified.

GSP as pseudo class label  A threshold is defined to control whether an is put into a pseudo class. Only when the maximal posterior probability of an is greater than the given threshold, the will be put into the class with the maximal posterior probability.  otherwise the will be put into a special class D other.

Algorithm: GSP-PCL GSP-PCL (k,GSP groups G 1,G 2,…G sp_num, set D) 1.Construct sp_num pseudo classes using GSP groups, D i 0 ={d | d ∈ D and d match G i },i=1,2,….,sp_num; 2.D ’ =D - ∪ sp_num i=1 D i 0 3.Iterative until coverage. For the J-th iteration,j>0: a) Train a SVM classifier based on D i j-1 i=1,…,sp_num. b) For each d ∈ D ’,classify d into D i j-1 if P(D i j-1 |d) is the maximal posterior probability and P(D i j-1 |d) > min_class_prob. 4.D other = D - ∪ sp_num i=1 D i j 5.Use basic k-means to partition D other into (k - sp_num) clusters.

Experiments 1.Enron dataset archive from many of the senior management of Enron Co. 2.Private dataset

Experiments  Let C={C 1 …C m } be a set of clusters generated by a clustering algorithm on a data set X,and B={B 1 …B n } be a set of predefined classes on X.  Each pair of two objects (X i,X j ) from the data set X belongs to one of the four possible cases:

Experiments  precision and recall and F1-Measure are calculated as following:

Experiments

Cluster Naming  In GSP-PCL, cluster names are generated as follows  If s in one cluster match one or more GSP groups, the cluster is named by the GSP with the highest support, otherwise it is named by the top five words sorted based on the scores computed as follows C j denotes the cluster d i is an t k is a word t ki is the weight of word t k in the d i

Cluster Naming

Conclusions  In this paper, we proposed a novel approach to automatically extract embedded knowledge from the subjects to help improve clustering.  Natural language processing technique and the frequent closed itemset mining technique are employed to generate generalized sentence patterns (GSP for short) from subjects, which can be used to assist clustering as well as serve as good cluster descriptors.