Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University)

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
Active Appearance Models
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Linear Regression.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Unsupervised Learning
Dimensionality Reduction PCA -- SVD
Supervised Learning Recap
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
What is Statistical Modeling
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
K nearest neighbor and Rocchio algorithm
Chapter 7 – K-Nearest-Neighbor
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Dimensional reduction, PCA
Ensemble Learning: An Introduction
Distributional Clustering of Words for Text Classification L. Douglas Baker Andrew Kachites McCallum SIGIR’98.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Causality challenge workshop (IEEE WCCI) June 2, Slide 1 Bernoulli Mixture Models for Markov Blanket Filtering and Classification Mehreen Saeed Department.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Unsupervised Learning
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.
Bayesian Networks. Male brain wiring Female brain wiring.
EM and expected complete log-likelihood Mixture of Experts
Text Classification, Active/Interactive learning.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Naive Bayes Classifier
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Text Clustering.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
1 Computing Relevance, Similarity: The Vector Space Model.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Slides for “Data Mining” by I. H. Witten and E. Frank.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Naive Bayes Classifier
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Vincent Granville, Ph.D. Co-Founder, DSC
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University) Andrew Kachites McCallum (Justsystem Pittsburgh Research Center)

Clustering Define what it means for words to be “similar”. “Collapse” the word space by grouping similar words in “clusters”. Key Idea for Distributional Clustering: –Class probabilities given the words in a labeled document collection P(C|w) provide rules for correlating words to classifications.

Voting Can be understood by a voting model: Each word in a document casts a weighted vote for classification. Words that normally vote similarly can be clustered together and vote with the average of their weighted votes without negatively impacting performance.

Benefits of Word Clustering Useful Semantic word clustering –Automatically generates a “Thesaurus” Higher classification accuracy –Sort of, we’ll discuss in the results section Smaller classification models – size reductions as dramatic as  50

Benefits of Smaller Models Easier to compute – with the constantly increasing amount of available text, reducing the memory space is clutch. Memory constrained devices like PDA’s could now use text classification algorithms to organize documents. More complex algorithms can be unleashed that would be infeasible in dimensions.

The Framework Start with Training Data with: –Set of Classes C = {c 1, c 2 … c m } –Set of Documents D ={d 1 … d n } –Each Document has a class label

Mixture Models f(x i |  ) =  p k h(x i | k ) Sum of p k ’s is 1 h is a distriution function for x (such as a Gausian) with k as the parameter ( ,  ) in the Gausian case. Thus  = (p 1 …p k, 1 … k )

What is  in this case? Assumption: one-to-one correspondence between the mixture model components and the classes. The class priors are contained in the vector  0 Instances of each class / number of documents

What is  in this case? The rest of the entries in  correspond to disjoint sets. The j th entry contains the probability of each word w t in the vocabulary V given the class c j. N(w t, d i ) is the number of times a word appears in document d i. P(c j |d i ) = {0, 1}

Prob. of a given Document in the Model The mixture model can be used to produce documents with probability: Just the sum of the probability of generating this document in the model over each class.

Documents as Collections of Words Treat each document as an ordered collection of word events. D ik = work in document d i at place k. Each word is dependent on preceding words

Apply Naïve Bayes Assumption Assume each word is independent of both content and position Where d ik = w t Update Formulas 2 and 1: –(2) P(d i | c j ;  ) =  P(w t |c j ;  ) –(1) P(d i |  ) =  P(c j |  )  P(w t |c j ;  )

Incorporate Expanded Formulae for  We can calculate the model parameter  from the training data. Now we wish to calculate P(c j |d i ;  ), the probability of document d i belonging to class c j.

Final Equation Class prior * (2)Product of all the probabilities of each word in the document assuming we are in class c j (1/2/3) Sum of all class priors * product of all word probabilities assuming we are in class c r Maximize and that value of c j is the class for the document

Shortcomings of the Framework In real world data (documents) there isn’t actually an underlying mixture model and the independence assumption doesn’t actually hold. But empirical evidence and some theoretical writing (Domingos and Pazzani 1997) indicates the damage from this is negligible.

What about clustering? So assuming the Framework holds… how does clustering fit into all this?

How Does Clustering affect probabilities? Fraction of cluster from w t + fraction of cluster from w s

Vs. other forms of learning Measures similarity based on the property it is trying to estimate (the classes) –Makes the supervision in the training data really important. Clustering is based on the similarity of the class variable distributions Key Idea: Clustering preserves the “shape” of the class distributions.

Kullock-Liebler Divergence Measures the similarity between class distributions D( P(C | w t ) || P(C | w s )) = If P(c j | w t ) = P(c j | w s ) then log(1) = 0

Problems with K-L Divergence Not symmetric Denominator can be 0 if w s does not appear in any documents of class c j.

K-L Divergence from the Mean Ratio of each words occurrence in the cluster * K- L divergence of that word within the cluster New and improved: uses a weighted average instead of just the mean Justification: fits clustering because independent distributions now form combined statistics.

Minimizing Error in Naïve Bayes Scores Assuming uniform class priors allows us to drop P(c j |  ) and the whole denominator from (6) Then performing a little algebra gets us the cross entropy: So error can be measured in the difference in cross-entropy caused by clustering. Minimizing this equation results in equation (9), so clustering in this method minimizes error.

The Clustering Algorithm Comparing similarity of all possible word clusters would be O(V 2 ) Instead, a number M is set as the total number of desired clusters –More supervision M clusters initialized with the M words with the highest mutual information to the class variable Properties: Greedy, scales efficiently

Algorithm  P(C | w t )

Related Work Chi Merge / Chi 2 –Use D. Clustering to discretize numbers Class-based clustering –Uses amount that mutual information is reduced to determine when to cluster –Not effective in text classification Feature Selection by Mutual Information –cannot capture dependencies between words Markov-blanket-based Feature Selection –Also attempts to Preserve P(C | w t ) shapes Latent Semantic Indexing –Unsupervised, using PCA

The Experiment : Competitors to Distributional Clustering Clustering with LSI Information Gain Based Feature Selection Mutual-Information Feature Selection Feature Selection involves cutting out redundant instances Clustering combines these redundancies

The Experiment: Testbeds 20 Newsgroups –20,000 articles from 20 usenet groups (apx words) ModApte “Reuters-21578” –9603 training docs, 3299 testing docs, 135 topics (apx words) Yahoo! Science (July 1997) –6294 pages in 41 classes (apx words) –Very noisy data

20 Newsgroups Results Averaged over 5-20 trials Computational constraints forced Markov blanket to a smaller data set (second graph) LSI uses only 1/3 training ratio

20 Newsgroups Analysis Distributional Clustering achieves 82.1% accuracy at 50 features, almost as good as having the full vocabulary. More accurate then all non-clustering approaches LSI did not add any improvement to clustering (claim: because it is unsupervised) On the smaller data set, D.C. achieves 80% accuracy far quicker then the others, in some cases doubling their performance for small numbers of features. Claim: Clustering outperforms Feature selection because it conserves information rather than discarding it.

Speed in 20-Newsgroups Test Distributional Clustering: 7.5 minutes LSI: 23 minutes Makov Blanket: 10 hours Mutual information feature selection (???): 30 seconds

Reuters Results D.C. outperforms others for small numbers of features Information-Gain based feature selection does better for larger feature sets. In this data set, documents can have multiple labels.

Yahoo! Results Feature selection performs almost as well or better in these cases Claim: The data is so noisy that it is actually beneficial to “lose data” via feature selection.

Performance Summary Only slight loss in accuraccy despite despite the reduction in feature space Preserves “redundent” information better than feature selection. The improvement is not as drastic with noisy data.

Improvements on Earlier D.C. Work Does not show much improvement on sparse data because the performance measure is related to the data distribution –D.C. preserves class distributions, even if these are poor estimates to begin with. Thus this whole method relies on accurate values for P(C | w i )

Future Work Improve D.C.’s handling of sparse data (ensure good estimates of P(C | w i ) Find ways to combine feature selection and D.C. to utilize the strengths of both (perhaps increase performance on noisy data sets?)

Some Thoughts Extremely supervised Needs to be retrained when new documents come in In a paper with a lot of topics, does Naïve Bayes (word independent of context) make sense? Didn’t work well in noisy data How can we ensure proper theta values?