A COMPARISON OF ANN, NAÏVE BAYES, AND DECISION TREE FOR THE PURPOSE OF SPAM FILTERING KAASHYAPEE JHA ECE/CS 539 1
NAÏVE BAYES CLASSIFIER Bayes Theorem:
PREPROCESSING Stop list: do not take into account trivial words like {or, and, but, a, an, the, is, in, for} Do not take into account words that are very uncommon
NAÏVE BAYES CLASSIFIER RESULTS Trial #1# of spam documents # of ham documents False positive rateAccuracy Training Set %98.9% Testing Set82476 Trial #3# of spam documents # of ham documents False positive rateAccuracy Training Set %96.9% Testing Set57232 Trial #2# of spam documents # of ham documents False positive rateAccuracy Training Set %98.1% Testing Set68451
SVM RESULTS Trial #1# of spam documents # of ham documents False positive rateAccuracy Training Set %99.6% Testing Set82476 Trial #3# of spam documents # of ham documents False positive rateAccuracy Training Set %97.6% Testing Set57232 Trial #2# of spam documents # of ham documents False positive rateAccuracy Training Set %99.3% Testing Set68451
WEAKNESS OF NAÏVE BAYES CLASSIFIER Example: hey man are you interested in sports? then me at Spammers can avoid using words that are more prone to being in a spam
WORK AHEAD Finish implementing and testing Decision Tree More preprocessing of the data Perform more trials with different ratios of training set and testing set