Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Text Classification What is …? categorize documents into specialized classes class label == target concept Why is …? exponentially increasing web documents upstream work for many other important topics (besides itself) document identification for information extraction (project 2) …

Distributional Clustering Benefits useful semantic word clusters higher classification accuracy smaller classification models Distributional clustering embedded Naïve Bayes classifier – the whole solution

Two Assumptions One-to-one assumption content: mixture model components VS. target classes  1-to-1 reality: independent target classes Naïve Bayes assumption content: word probabilities equals in one text reality: word event independent of context and position

Naïve Bayes Framework Training documents set D = {d 1, d 2, …, d n } Target classes set C = {c 1, c 2, …, c m } Mixture (parametric) model component parameterized by  estimation of  is denoted as Target classifier probability of each class given the evidence of the test document Bayes rule

Naïve Bayes Framework (cont.) Probability of each document given the mixture model Bayes optimal classifier C Probability of a document given class C j 1-to-1 Naïve Bayes

Naïve Bayes Framework (cont.)

uniform class prior Constant over all classes Transform the equation above uniform class prior, dropping dropping the denominator (constant over all classes) product over document  product over vocabulary take a log and divide by document length |d i | compute argmax

Naïve Bayes Framework (cont.) argmax of argmin of distribution of words in the document distribution of words in the class distribution of clusters instead ?!

Distributional Clustering Intuition P(C|w t ) express the distributional probabilities for word w t over all the classes Cluster words so as to preserve the distribution

Kullback-Leibler Divergence Measurement for similarity between distributions Traditional KL divergence: equals Shortcomings not symmetric may have infinite result KL divergence to the mean equals

Clustering Algorithm 1. Sort the vocabulary by mutual information with class variable 2. Initialize M clusters as singletons with top M words 3. Loop until all words have been put into one of M clusters: Merge two clusters which are most similar, resulting in M - 1 clusters Create a new cluster consisting of the next word from the sorted list, restoring the number of clusters to M Results are used to compute P(c j |d i ;θ) for each class and to assign the document to the most probable class

Experimental Results 20 Newsgroups 20000 articles evenly divided among 20 Newsgroups vocabulary: 62258 words 50 features Distributional Clustering: 82.1% LSI: 60% Mutual Information: 46.3% Class-based Clustering: 14.5% Markov blanket feature selector: ~60% DC better than feature selection infrequent feature may important when occurring merging preserves information

Experimental Results Reuters-21578 & Yahoo! data set Reuters-21578 data set 90/135 topic categories vocabulary: 16177 words DC outperform others when small feature set size Yahoo! data set 6294 web pages in 41 classes vocabulary: 44383 words Naïve Bayes with 500 words achieves 66.4%, highest! training data are too noisy

Conclusion DC aggressively reduces the number of features while maintaining high classification accuracy DC outperforms followings at small feature set size supervised Latent Semantic Indexing class-based clustering feature selection by mutual information feature selection by a Markov-blanket method DC may not overcome the sparse data problem strongly biased to preserving the bad beginning estimation of P(C|w i )

Mixture Model classes documents c1c1 c2c2 c3c3 cncn … d1d1 d2d2 d3d3 dmdm … F 1 (d 1, d 2, d 3, …, d m )  (c 1, c 2, c 3, …, c n )

Mixture Model classes documents c1c1 c2c2 c3c3 cncn … d1d1 d2d2 d3d3 dmdm … ?

Mixture Model classes documents 11 22 33 nn … d1d1 d2d2 d3d3 dmdm … F 2 (  1,  2,  3, …,  m,  1  2,  1  3, …,  2  3, …,  1  2 …  m )  (d 1, d 2, d 3, …, d m ) 1-to-1 between C & 

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Similar presentations

Presentation on theme: "Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Similar presentations

Presentation on theme: "Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding."— Presentation transcript:

Similar presentations

About project

Feedback