Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer Science and Engineering, Hong Kong University of Science and Technology (ICDM 2006) Date: 2008/10/16 Speaker: Cho Chin Wei Advisor: Dr. Koh, JiaLing
Abstract This paper presents a novel algorithm to cluster s according to their contents and the sentence styles of their subject lines. improve the clustering performance provide meaningful descriptions of the resulted clusters
Introduction s become an important medium of communication. A user may receive tens or even hundreds of s every day. Handling these s takes much time. It is necessary to provide some automatic approaches to relieve the burden of processing the s.
Introduction Supervised methods need a predefined taxonomy. Require users to update the taxonomy manually. Preparation of training data is time consuming and expensive. Unsupervised technique such as clustering is an attractive alternative.
Introduction Conventionally, clustering is based on the representation of bag-of-words. This simplistic approach cannot take full advantage of valuable linguistic features inherent in the semi-structured s, which may result in unsatisfactory performance.
Method In this paper, we present a novel technique to cluster s according to the sentence patterns discovered from the subject lines of the s. Each subject line is treated as a sentence and parsed through natural language processing techniques.
Method Automatically generate meaningful generalized sentence patterns (GSPs) from subjects of s. Natural Language Processing techniques frequent itemset mining techniques Then, put forward a novel unsupervised approach treats GSPs as pseudo class labels conduct clustering in a supervised manner no human labeling is involved
What is GSP? GSP (Generalized Sentence Pattern) It helps summarize the subjects of a large number of similar s which results in a semantic representation of the subject lines. Ex: the GSP is {“person”, “seminar”, “date”} which means that someone (“person”) gives a “seminar” on someday (“date”).
Generalization of Terms - NLPWin A Natural Language Parser, Microsoft ’ s NLPWin tool takes a sentence as input and builds a syntactic tree for the sentence
Generalization of Terms - NLPWin NLPWin tool can generate factoids for noun phrases, e.g. person names, date and place.
Mine GSP For each subject, Remove stop words NLPWin is used to generate its syntactic tree factoids of the nodes are added into the subjects The resultant subjects are called as generalized sentences. Subsets of GSPs are still GSPs. EX. {welcome, Bruce, Lee, person} {welcome, Bob, Brill, person} are the generalized sentences.
Mine GSP Definition: Generalized Sentence Pattern (GSP) Given a set of generalized sentences S={s 1,s 2,...,s n } and a generalized sentence p, the support of p in S is defined as sup(p) =| {s i | s i ∈ S & p ⊆ s i } | Given a minimum support min_sup, if sup(p) > min_sup, then p is called as a generalized sentence pattern, or GSP for short. EX. {welcome, person} and {interview, date} Only closed GSPs are to be mined in our experiment.
GSPs grouping and selection Grouping similar GSPs together to reduce the number of GSPs. GSPs in the same group will present the same cluster. The similarity between two GSPs p and q is defined based on their subset-superset relationship and support. Sim(p,q) = 1 ; p ⊃ q : { sup(p) / sup(q) > min_conf } = 0 ; otherwise Min_conf is between 0.5 and 0.8 by experimental result. too low=> group more GSPs, noises too high=> fail to group
GSPs grouping and selection The number of GSP groups can be several times larger than the actual number of clusters. We want to select some representative GSP groups by some heuristic rules. sort the GSP groups in descending order of length. sort them in descending order of support. select the first sp_num GSP groups for clustering. A parameter sp_num is used to control how many GSP groups are select for clustering. Longer GSP is more confident then a shorter one.
GSP as pseudo class label Based on GSPs, we proposed a novel clustering algorithm to form a pseudo class for the s matching the same GSP group, and then use a discriminative variant of Classification Expectation Maximization algorithm (CEM) to get the final clusters. When CEM is applied to document clustering, the high-dimension usually causes the inaccurate model estimation and degrades the efficiency. Here linear SVM is used as the underlying classifier.
GSP as pseudo class label The proposed GSP-PCL algorithm uses GSP groups to construct initial pseudo classes. Only the s not matching any GSP group are classified.
GSP as pseudo class label A threshold is defined to control whether an is put into a pseudo class. Only when the maximal posterior probability of an is greater than the given threshold, the will be put into the class with the maximal posterior probability. otherwise the will be put into a special class D other.
Algorithm: GSP-PCL GSP-PCL (k,GSP groups G 1,G 2,…G sp_num, set D) 1.Construct sp_num pseudo classes using GSP groups, D i 0 ={d | d ∈ D and d match G i },i=1,2,….,sp_num; 2.D ’ =D - ∪ sp_num i=1 D i 0 3.Iterative until coverage. For the J-th iteration,j>0: a) Train a SVM classifier based on D i j-1 i=1,…,sp_num. b) For each d ∈ D ’,classify d into D i j-1 if P(D i j-1 |d) is the maximal posterior probability and P(D i j-1 |d) > min_class_prob. 4.D other = D - ∪ sp_num i=1 D i j 5.Use basic k-means to partition D other into (k - sp_num) clusters.
Experiments 1.Enron dataset archive from many of the senior management of Enron Co. 2.Private dataset
Experiments Let C={C 1 …C m } be a set of clusters generated by a clustering algorithm on a data set X,and B={B 1 …B n } be a set of predefined classes on X. Each pair of two objects (X i,X j ) from the data set X belongs to one of the four possible cases:
Experiments precision and recall and F1-Measure are calculated as following:
Experiments
Cluster Naming In GSP-PCL, cluster names are generated as follows If s in one cluster match one or more GSP groups, the cluster is named by the GSP with the highest support, otherwise it is named by the top five words sorted based on the scores computed as follows C j denotes the cluster d i is an t k is a word t ki is the weight of word t k in the d i
Cluster Naming
Conclusions In this paper, we proposed a novel approach to automatically extract embedded knowledge from the subjects to help improve clustering. Natural language processing technique and the frequent closed itemset mining technique are employed to generate generalized sentence patterns (GSP for short) from subjects, which can be used to assist clustering as well as serve as good cluster descriptors.