Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb
Introduction What is Text Classification? Naïve Bayes Event Models Binomial Model Binning Conclusion
Text Classification Grouping documents of the same topics For example, Sport, Politics, e.t.c. Slow process for humans
Naïve Bayes P(c j | d i ) = P(c j ) P(d i | c j ) P(d) This is Bayes theorem Naïve Bayes assumes independence between attributes, in this case words. Not a correct assumption however still performs classification well.
Event Models Different ways of viewing a document In Bayes rule this translates to different ways of calculating, P(d i | c j ). There are two frequently used models
Multi–Variate Bernoulli Model In text classification terms, – A document(d i ) is an EVENT – Words(w t ) within the document are considered as ATTRIBUTES of d i – Number of occurrences of a word in a document is not recorded – When calculating the probability of class member ship all words in the vocabulary are considered even if thet don’t appear in document
Multinomial Model Number of occurrences of a word is captured Individual word occurrences are considered as “events” The document is considered to be a collection of events Only words that appear in the document and their counts are considered when calculation class membership
Previous Comparison Multi-Variate model good for small vocabulary Multi-Nomial model good for large vocabulary. Multi-Nomial much faster then the Multi- Variate
Binomial Model Want to capture occurances and non- occurances as well as word frequencies. P(d i | c j ) = Sum of P(c) + P(w | d) N * P(~w | d) L-N Where c = class, w = word, d = document, L = length and n = no of occurances of word
Binomial Results Performed just as well as multinomial with large vocabulary, however much slower. Outperformed Multi-Variate once vocabulary increased However did worse then existing techniques with smaller vocabulary sizes
Binomial Results Number of Words in the Vocabulary % Correctly Classed
Document Length None of the techniques take in to account document length. Currently, P(d | c) = f (w Є d, c) However we should incorporate document length. P(d | c) = f (w Є d, l, c)
Binning Discretization has been found to be effective for numeric variables for Naïve Bayes. Groups documents of similar lengths Theory is the distributions will differ significantly for different lengths This will help improve classification
Binning For my tests, bin size = 1000, if less then 2000 documents only use two bins. Increasing Document Size Bin 1Bin 2Bin 3 Bin 4
Binning Example Two Bins are created. Tables with word counts for each class within a bin for are created as opposed to one table for all words as per traditional methods. George Bush GWB Not GWB 4/20 7/20 3/20 1/20 Cat Length words Length words 3/20 2/20 3/20 2/20 3/207/20 George BushCat GWB Not GWB
Binning Given a unseen document, binning helps refine probabilities. For example If no bins, the probability that the word ‘Bush’ occurs in the GWB class is 10/40 or 25%. If we know that the document is in the words bin the probability of the word ‘Bush’ appearing in GWB is 7/20 or 35%.
Binning Results When applied to all datasets binning improved classification accuracy on all techniques
Binning Results 7 Sectors Dataset, Multi-Variate Method
Binning Results WebKB Dataset, Multi-Nomial Method
Conclusion/Future Goals Binning best solution Applicable to all event models In future apply event models and binning techniques to classification techniques other then Naïve Bayes.