Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University)

Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University) Andrew Kachites McCallum (Justsystem Pittsburgh Research Center)

Clustering Define what it means for words to be “similar”. “Collapse” the word space by grouping similar words in “clusters”. Key Idea for Distributional Clustering: –Class probabilities given the words in a labeled document collection P(C|w) provide rules for correlating words to classifications.

Voting Can be understood by a voting model: Each word in a document casts a weighted vote for classification. Words that normally vote similarly can be clustered together and vote with the average of their weighted votes without negatively impacting performance.

Benefits of Word Clustering Useful Semantic word clustering –Automatically generates a “Thesaurus” Higher classification accuracy –Sort of, we’ll discuss in the results section Smaller classification models – size reductions as dramatic as 50000  50

Benefits of Smaller Models Easier to compute – with the constantly increasing amount of available text, reducing the memory space is clutch. Memory constrained devices like PDA’s could now use text classification algorithms to organize documents. More complex algorithms can be unleashed that would be infeasible in 50000 dimensions.

The Framework Start with Training Data with: –Set of Classes C = {c 1, c 2 … c m } –Set of Documents D ={d 1 … d n } –Each Document has a class label

Mixture Models f(x i |  ) =  p k h(x i | k ) Sum of p k ’s is 1 h is a distriution function for x (such as a Gausian) with k as the parameter ( ,  ) in the Gausian case. Thus  = (p 1 …p k, 1 … k )

What is  in this case? Assumption: one-to-one correspondence between the mixture model components and the classes. The class priors are contained in the vector  0 Instances of each class / number of documents

What is  in this case? The rest of the entries in  correspond to disjoint sets. The j th entry contains the probability of each word w t in the vocabulary V given the class c j. N(w t, d i ) is the number of times a word appears in document d i. P(c j |d i ) = {0, 1}

Prob. of a given Document in the Model The mixture model can be used to produce documents with probability: Just the sum of the probability of generating this document in the model over each class.

Documents as Collections of Words Treat each document as an ordered collection of word events. D ik = work in document d i at place k. Each word is dependent on preceding words

Apply Naïve Bayes Assumption Assume each word is independent of both content and position Where d ik = w t Update Formulas 2 and 1: –(2) P(d i | c j ;  ) =  P(w t |c j ;  ) –(1) P(d i |  ) =  P(c j |  )  P(w t |c j ;  )

Incorporate Expanded Formulae for  We can calculate the model parameter  from the training data. Now we wish to calculate P(c j |d i ;  ), the probability of document d i belonging to class c j.

Final Equation Class prior * (2)Product of all the probabilities of each word in the document assuming we are in class c j ------------------------------------------------------------- (1/2/3) Sum of all class priors * product of all word probabilities assuming we are in class c r Maximize and that value of c j is the class for the document

Shortcomings of the Framework In real world data (documents) there isn’t actually an underlying mixture model and the independence assumption doesn’t actually hold. But empirical evidence and some theoretical writing (Domingos and Pazzani 1997) indicates the damage from this is negligible.

What about clustering? So assuming the Framework holds… how does clustering fit into all this?

How Does Clustering affect probabilities? Fraction of cluster from w t + fraction of cluster from w s

Vs. other forms of learning Measures similarity based on the property it is trying to estimate (the classes) –Makes the supervision in the training data really important. Clustering is based on the similarity of the class variable distributions Key Idea: Clustering preserves the “shape” of the class distributions.

Kullock-Liebler Divergence Measures the similarity between class distributions D( P(C | w t ) || P(C | w s )) = If P(c j | w t ) = P(c j | w s ) then log(1) = 0

Problems with K-L Divergence Not symmetric Denominator can be 0 if w s does not appear in any documents of class c j.

K-L Divergence from the Mean Ratio of each words occurrence in the cluster * K- L divergence of that word within the cluster New and improved: uses a weighted average instead of just the mean Justification: fits clustering because independent distributions now form combined statistics.

Minimizing Error in Naïve Bayes Scores Assuming uniform class priors allows us to drop P(c j |  ) and the whole denominator from (6) Then performing a little algebra gets us the cross entropy: So error can be measured in the difference in cross-entropy caused by clustering. Minimizing this equation results in equation (9), so clustering in this method minimizes error.

The Clustering Algorithm Comparing similarity of all possible word clusters would be O(V 2 ) Instead, a number M is set as the total number of desired clusters –More supervision M clusters initialized with the M words with the highest mutual information to the class variable Properties: Greedy, scales efficiently

Algorithm  P(C | w t )

Related Work Chi Merge / Chi 2 –Use D. Clustering to discretize numbers Class-based clustering –Uses amount that mutual information is reduced to determine when to cluster –Not effective in text classification Feature Selection by Mutual Information –cannot capture dependencies between words Markov-blanket-based Feature Selection –Also attempts to Preserve P(C | w t ) shapes Latent Semantic Indexing –Unsupervised, using PCA

The Experiment : Competitors to Distributional Clustering Clustering with LSI Information Gain Based Feature Selection Mutual-Information Feature Selection Feature Selection involves cutting out redundant instances Clustering combines these redundancies

The Experiment: Testbeds 20 Newsgroups –20,000 articles from 20 usenet groups (apx 62000 words) ModApte “Reuters-21578” –9603 training docs, 3299 testing docs, 135 topics (apx. 16000 words) Yahoo! Science (July 1997) –6294 pages in 41 classes (apx. 44000 words) –Very noisy data

20 Newsgroups Results Averaged over 5-20 trials Computational constraints forced Markov blanket to a smaller data set (second graph) LSI uses only 1/3 training ratio

20 Newsgroups Analysis Distributional Clustering achieves 82.1% accuracy at 50 features, almost as good as having the full vocabulary. More accurate then all non-clustering approaches LSI did not add any improvement to clustering (claim: because it is unsupervised) On the smaller data set, D.C. achieves 80% accuracy far quicker then the others, in some cases doubling their performance for small numbers of features. Claim: Clustering outperforms Feature selection because it conserves information rather than discarding it.

Speed in 20-Newsgroups Test Distributional Clustering: 7.5 minutes LSI: 23 minutes Makov Blanket: 10 hours Mutual information feature selection (???): 30 seconds

Reuters-21578 Results D.C. outperforms others for small numbers of features Information-Gain based feature selection does better for larger feature sets. In this data set, documents can have multiple labels.

Yahoo! Results Feature selection performs almost as well or better in these cases Claim: The data is so noisy that it is actually beneficial to “lose data” via feature selection.

Performance Summary Only slight loss in accuraccy despite despite the reduction in feature space Preserves “redundent” information better than feature selection. The improvement is not as drastic with noisy data.

Improvements on Earlier D.C. Work Does not show much improvement on sparse data because the performance measure is related to the data distribution –D.C. preserves class distributions, even if these are poor estimates to begin with. Thus this whole method relies on accurate values for P(C | w i )

Future Work Improve D.C.’s handling of sparse data (ensure good estimates of P(C | w i ) Find ways to combine feature selection and D.C. to utilize the strengths of both (perhaps increase performance on noisy data sets?)

Some Thoughts Extremely supervised Needs to be retrained when new documents come in In a paper with a lot of topics, does Naïve Bayes (word independent of context) make sense? Didn’t work well in noisy data How can we ensure proper theta values?

Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University)

Similar presentations

Presentation on theme: "Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University)

Similar presentations

Presentation on theme: "Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University)"— Presentation transcript:

Similar presentations

About project

Feedback